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Background of the Invention 

The invention pertains to wireless communications and, more particularly, to 
communications computers. The invention has application, by way of non-limiting example, in 
improving the capacity of cellular phone base stations. 

Code-division multiple access (CDMA) is used increasingly in wireless communications. 
It is a form of multiplexing communications, e.g., between cellular phones and base stations, 
based on distinct digital codes in the communication signals. This can be contrasted with other 
wireless protocols, such as frequency-division multiple access and time-division multiple access, 
in which multiplexing is based on the use of orthogonal frequency bands and orthogonal time- 
slots, respectively. 

B 

A limiting factor in CDMA communication and, particularly, in so-called direct sequence 
CDMA (DS-CDMA), is the interference between multiple simultaneous communications, e.g., 
llJ multiple cellular phone users in the same geographic area using their phones at the same time. 
J^. This is referred to as multiple access interference (MAI). It has effect of limiting the capacity of 

lip cellular phone base stations, since interference may exceed acceptable levels ~ driving service 

% =.. 

quality below acceptable levels ~ when there are too many users. 

o 

A technique known as multi-user detection (MUD) reduces multiple access interference 
and, as a consequence, increases base station capacity. MUD can reduce interference not only 
between multiple signals of like strength, but also that caused by users so close to the base 
station as to otherwise overpower signals from other users (the so-called near/far problem). 
MUD generally functions on the principle that signals from multiple simultaneous users can be 
jointly used to improve detection of the signal from any single user. Many forms of MUD are 
known; surveys are provided in Moshavi, "Multi-User Detection for DS-CDMA Systems," IEEE 
Communications Magazine (October, 1996) and Duel-Hallen et al, "Multiuser Detection for 
CDMA Systems," IEEE Personal Communications (April 1995). Though a promising solution 
to increasing the capacity of cellular phone base stations, MUD techniques are typically so 
computationally intensive as to limit practical application. 
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An object of this invention is to provide improved methods and apparatus for wireless 
communications. A related object is to provide such methods and apparatus for multi-user 
detection or interference cancellation in code-division multiple access communications. 

A further object of the invention is to provide such methods and apparatus as can be cost- 
effectively implemented and as require minimal changes in existing wireless communications 
infrastructure. 

A still further object of the invention is to provide methods and apparatus for executing 
multi-user detection and related algorithms in real-time. 




A still further object of the invention is to provide such methods and apparatus as manage 
faults for high-availability. 



2 



EV 093 931 797 US 
Page No. 4 



Summary of the Invention 

These and other objects are met by the invention which provides, in one aspect, a 
communications computer, referred to as the "MCW-1" (among other terms) in the materials that 
follow, and methods of operation thereof. An overview of that system is provided in the section 
entitled "Communications Computer," beginning on page 5 hereof. A more complete 
understanding of its implementation may be attained by reference to the other attached materials. 

In view of those materials, aspects of the invention include, but are not limited to the 



following: 




architecture and operation of a communications computer for a wireless 
communications system, including a fully programmable computer inserted 
into base transceiver station (BTS) to support compute-intensive and/or highly 
data-dependent functions such as adaptive processing and interference 
cancellation 



10 




These and other aspects of the invention (includmg utilization of the aforementioned 
methods and aspects for other than wireless communications and/or interference cancellation) 
are evident in the materials that follow. 
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Detailed Description of the Invention 

See the attached materials on pages 5-11 hereof, providing description and block 
diagram of a preferred structure and operation of a communications computer for wireless 
applications according to the invention. 

The aforementioned materials pertain to improvements on the methods and apparatus 
described in United States Provisional Application Serial No. 60/275,846, filed March 14, 2001, 
entitled IMPROVED WIRELESS COMMUNICATIONS SYSTEMS AND METHODS and 
United States Provisional Application Serial No. 60/289,600, filed May 7, 2001, entitled 
IMPROVED WIRELESS COMMUNICATIONS SYSTEMS AND METHODS USING LONG- 
p CODE MULTI-USER DETECTION, the teachings of both of which are incorporated herein by 
% reference and copies of at least portions of which are attached hereto. Those copies bears the 
U.S. Postal Service Express Mail label number of both prior filings, as well as that of this filing 
(the latter being referred to as the *T^ew Exp. Mail Label No."). 
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Background of the Invention 

The invention pertains to wireless communications and, more particularly, to methods 
and apparatus for interference cancellation in code-division multiple access communications. 
The invention has application, by way of non-limiting example, in improving the capacity of 
cellular phone base stations. 



Code-division multiple access (CDMA) is used increasingly in wireless communications. 
It is a form of multiplexing communications, e.g., between cellular phones and base stations, 
based on distinct digital codes in the communication signals. This can be contrasted with other 
wireless protocols, such as frequency-division multiple access and time-division multiple access, 
1*^ in which multiplexing is based on the use of orthogonal frequency bands and orthogonal time- 
slots, respectively. 

A limiting factor in CDMA communication and, particularly, in so-called direct sequence 
CDMA (DS-CDMA), is the interference between multiple simultaneous communications, e.g., 
multiple cellular phone users in the same geographic area using their phones at the same time. 
^3 This is referred to as multiple access interference (MAI). It has effect of Hmiting the capacity of 
cellular phone base stations, since interference may exceed acceptable levels — driving service 
quality below acceptable levels — when there are too many users. 

A technique known as multi-user detection (MUD) reduces multiple access interference 
and, as a consequence, increases base station capacity. MUD can reduce interference not only 
between multiple signals of like strength, but also that caused by users so close to the base 
station as to otherwise overpower signals from other users (the so-called near/far problem). 
MUD generally functions on the principle that signals from multiple simultaneous users can be 
jointly used to improve detection of the signal from any single user. Many forms of MUD are 
known; surveys are provided in Moshavi, "Multi-User Detection for DS-CDMA Systems," IEEE 
Communications Magazine (October, 1996) and Duel-Hallen et al, "Multiuser Detection for 
CDMA Systems," IEEE Personal Communications (April 1995). Though a promising solution 
to increasing the capacity of cellular phone base stations, MUD techniques are typically so 
computationally intensive as to limit practical application. 
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An object of this invention is to provide improved methods and apparatus for wireless 
communications. A related object is to provide such methods and apparatus for multi-user 
detection or interference cancellation in code-division multiple access communications. 

A further object of the invention is to provide such methods and apparatus as can be cost- 
effectively implemented and as reqxiire minimal changes in existing wireless communications 
infrastructure. 



A still further object of the invention is to provide methods and apparatus for executing 
1^ multi-user detection and related algorithms in real-time. 

s 

A still further object of the invention is to provide such methods and apparatus as manage 
faults for high-availability. 
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Summary of the Invention 



These and other objects are met by the invention which provides, in one aspect, a 
wireless communications system, referred to as the "MCW-1" (among other tenns) in the 
materials that follow, and methods of operation thereof. An overview of that system is provided 
in the document entitled "Software Architecture of the MCW-1 MUD Board," immediately 
following this Summary. A more complete understanding of its implementation may be attained 
by reference to the other attached materials. 



In view of those materials, aspects of the invention include, but are not Hmited to, the 
following: 

r| • methods and apparatus for long-code multi-user detection (MUD) in a wireless 

\0 communications system. 

m 
w 

These and other aspects of the invention (including utilization of the aforementioned 
p methods and aspects for other than wireless communications and/or interference cancellation) 

IsJ are evident in the materials that follow. 
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Detailed Description of the Invention 

See the attached materials on pages 5-12 hereof, providing a block diagram of a 
preferred algorithm for long code MUD which includes identification of (roughly) how many 
GOPS are involved in each major function; a diagram showing interfaces between a long code 
MUD processing card according to the invention and a modem, e.g., of the type provided by 
Motorola (or another supplier of such components); and two block diagrams of the same 
BASELINE 0 board hardware architecture at a top level identifiying the processing nodes. The 
attached diagram entitled "Long-code Mapping to Hardware" illustrates support of 64 users for 
long code MUD and shows parts of the long code MUD algorithm siqpported by each processing 
1^ node. The diagram entitled "Short-code Mapping to Hardware" illustrates support of 128 users 
O for short code MUD and shows parts of the short code MUD algorithm would be supported by 

each processing node. 



m 
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The aforementioned materials pertain to improvements on the methods and apparatus 
described in United States Provisional Application Serial No. 60/275,846, filed March 14, 2001, 
13 entitled IMPROVED WIRELESS COMMUNICATIONS SYSTEMS AND METHODS, the 

1^' teachings of which are incorporated herein by reference and a copy of which is attached hereto. 

That copy bears the U.S. Postal Service Express Mail label number of both the original filing, as 
O well as that of this filing (the latter being referred to as the "New Exp. Mail Label No."). 
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Background of the Invention 

The invention pertains to wireless communications and, more particularly, to methods 
and apparatus for interference cancellation in code-division multiple access communications. 
The invention has application, by way of non-limiting example, in improving the capacity of 
cellular phone base stations. 



Code-division multiple access (CDMA) is used increasingly in wireless commxmications. 
It is a form of multiplexing communications, e.g., between cellular phones and base stations, 
based on distinct digital codes in the communication signals. This can be contrasted with other 
wireless protocols, such as frequency-division multiple access and time-division multiple access, 
in which multiplexing is based on the use of orthogonal frequency bands and orthogonal time- 
slots, respectively. 



13 



A limiting factor in CDMA communication and, particularly, in so-called direct sequence 
f ^ CDMA (DS-CDMA), is the interference between multiple simultaneous communications, e.g., 
"^^^ multiple cellular phone users in the same geographic area using their phones at the same time. 
13 This is referred to as multiple access interference (MAI). It has effect of limiting the capacity of 
cellular phone base stations, since interference may exceed acceptable levels — driving service 
quality below acceptable levels — when there are too many users. 



Ill 



ru 



A technique known as multi-user detection (MUD) reduces multiple access interference 
and, as a consequence, increases base station capacity. MUD can reduce interference not only 
between multiple signals of like strength, but also that caused by users so close to the base 
station as to otherwise overpower signals from other users (the so-called near/far problem). 
MUD generally functions on the principle that signals from multiple simultaneous users can be 
jointly used to improve detection of the signal from any single user. Many forms of MUD are 
known; surveys are provided in Moshavi, "Multi-User Detection for DS-CDMA Systems," IEEE 
Communications Magazine (October, 1996) and Duel-Hallen et al, "Multiuser Detection for 
CDMA Systems," IEEE Personal Communications (April 1995). Though a promising solution 
to increasing the capacity of cellular phone base stations, MUD techniques are typically so 
computationally intensive as to limit practical application. 



1 



Page No. 1 



EV 093 931 797 US 
Page No. 28 



An object of this invention is to provide improved methods and apparatus for wireless 
communications. A related object is to provide such methods and apparatus for multi-user 
detection or interference cancellation in code-division multiple access communications. 

A further object of the invention is to provide such methods and apparatus as can be cost- 
effectively implemented and as require minimal changes in existing wireless communications 
infrastructure. 



A still further object of the invention is to provide methods and apparatus for executing 
1^ multi-user detection and related algorithms in real-time. 



A still further object of the invention is to provide such methods and apparatus as manage 
^^D faults for high-availability. 
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Summary of the Invention 



These and other objects are met by the invention which provides, in one aspect, a 
wireless communications system, referred to as the "MCW-1" (among other terms) in the 
materials that follow, and methods of operation thereof. An overview of that system is provided 
in the document entitled "Software Architecture of the MCW-1 MUD Board," immediately 
following this Sxmmiary. A more complete imderstanding of its implementation may be attained 
by reference to the other attached materials. 
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In view of those materials, aspects of the invention include, but are not limited to, the 
following: 



. hardware and/or software architectures (and methods of operation thereof) for 
%B multi-user detection in wireless conmiunications systems and particularly, for 

example, in a wireless communications base station; 



• a hardware architecture (and methods of operation thereof) for multi-user 



detection in wireless communications systems pairing each processing node with 
NVRAM and watchdog PLD for fault management; 



methods and apparatus for connecting watchdog PLDs with an out-of-band fault- 
management bus; 



• methods and apparatus for use of an embedded host with the RACEway™ 
architecture of Mercury Computer Systems, Inc. 

• methods and apparatus for interfacing a digital signal processor to the 
RACEway™ architecture; 
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methods and apparatus for interfacing the RACEway™ architecture to a 
programming port in a device for multi-user detection in wireless communications 
systems; 

methods and apparatus for implementing a DMA Engine FPGA for use in multi- 
user detection in a wireless communications systems; 

methods and apparatus for implementing a hardware-based reset voter and stop 
voter; 

methods and apparatus for scalable mapping of handset and BTS functions to 
multiple processors; 

methods and apparatus for facilitating allocation and management of buffers for 
interconnecting processors that implement the aforementioned mapping; 

methods and apparatus for implementing a hybrid operating system, e.g., with the 
VxWorks operating system (of WindRiver Systems, Inc.) on a host computer and 
the MC/OS operating system on RACE®-based nodes. (Race and MC/OS are 
trademarks of Mercury Computer Systems, Inc.); 

methods and apparatus for high-availability multi-user detection in wireless 
communications systems, including (by way of non-limiting example) round- 
robin fault testing and use of NVRAM to store fault symptoms and use of master 
to diagnose faults from NVRAM contents; 

class library-based methods and apparatus for facilitating interprocessor 
communications, by way of non-limiting example, in buffering for multi-user 
detection in wireless communications systems; 
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• methods and apparatus for implementation of R-matrix, gamma-matrix and MPIC 
computations on separate processors in a device for multi-user detection in 
wireless communications systems; 

• methods and apparatus for computing complementary R-matrix elements in 
parallel using multiple processors in a device for multi-user detection in wireless 
communications systems; 

• methods and apparatus for depositing results of R-matrix calculations 
contiguously in memory in a device for multi-user detection in wireless 
communications systems; 

i gj • methods and apparatus for increasing the number of MPIC and R-matrix 

^ calculations performed in cache in a device for multi-user detection in wireless 

m 

|gj commumcations systems; 



• methods and apparatus for performing a gamma-matrix calculation in FPGA in a 



^fe' device for multi-user detection in wireless communications systems; 



• methods and apparatus for equalizing load of R-matrix-element calculation 
among multiple processors in a device for multi-user detection in wireless 
communications systems; and 

• methods and apparatus for use of AWvec registers and instruction set in 
performing MUD calculations in a wireless communications system. 

These and other aspects of the invention (including utilization of the aforementioned 
methods and aspects for other than wireless communications and/or interference cancellation) 
are evident in the materials that follow. 
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Detailed Description of the Invention 



5 (see attached materials) 
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Software Architecture of tiie MCW-1 MUD Board 

42 

43 1 Purpose 

44 The piJipose of this document is to describe the software architecture of the 

45 MCW-1 board. The MCW-1 application is a digital signal processing application 

46 that performs interference cancellation for a cellular base station modem board. 

47 The software project consists of 3 major parts: 

48 • Support for the custom MCW-1 board being designed by the Wireless 

49 Communications Group hardware department. This consists of porting the 

50 existing host (VxWorks) and multicomputer (MC/OS) software to the board, 

51 and adding code to support specialized features of the board such as LED 

52 control, voltage monitoring, hardware watchdogs, etc. 

53 • Increasing the MTBF of the system by addition of high availability software. 

54 This software includes monitoring features such as watchdogs, fault 

55 detection/repair algorithms, and remote software download. 

56 • Implementation of the application software. This includes optimal 

O 57 implementation of the MUD algorithms, as well as implementing degraded 

l3 58 versions of the algorithm that can be executed when some of the 

^fi= 59 computational hardware is unavailable due to failures. 
%8 60 

ftf 61 Detailed information on the design of new software for the MCW- 1 board can 

J& 62 be found in the appropriate fimctional design documents, which are listed in the 

63 References section of this document. 



O 64 2 Glossary 



65 

66 1 . MTBF ~ Mean Time Between Failures 

67 2. MUD - Multi User Detection. A class of algorithms to detect multiple 
ly 68 interference sources and remove those effects from the signal. 

69 3. Multicomputer - a parallel computer which achieves it's increase in performance 

70 by having more than one CPU working on the application simultaneously. 

71 4. VxWorks - a proprietary real time operating system sold by Wind River, Inc. 

72 3 Application Execution Environment 

73 3,1 Overview 

74 The purpose of the MUD application is to input raw antenna data from the base 

75 station modem card, detect sources of interference, produce a new stream of data 

76 which has had interference removed, and then output the data to the modem card 

77 for further processing. 

78 Characteristics of this processing me-ave that it must have low latency (< 300 

79 microseconds), zm4-must deal with large amounts of data (> 1 10 million bytes of 

80 data per second), an d must be very reliable. 

81 The Mercury computer system is well suited to this kind of signal processing, 

82 exhibiting both very low latencies and high bandwidths. 
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83 The system hardware and software were not designed with high availability as 

84 a goal, so reliability is in line with other standard computer systems designed for 

85 commercial applications 

86 Input data flows from the Modem Motherboard, over the PCI bus, through the 

87 PXB++ bridge, onto the fabric, through the crossbar, and into the memory of the 

88 computing elements. Output data flows in the opposite direction. Some data will 

89 also flow between the 8240 Host CPU and the compute elements, via a similar 

90 pathway, i.e. from the PCI bus through the PXB++ and thus onto the fabric. 

91 Although the software tries to treat the system as if the hardware were 

92 symmetric, as can be seen in the following figure, the host 8240 CPU is attached 

93 via the PCI bus, not directly to the fabric. 
94 

95 I Error! Not a valid link, 

96 Figure 1 

97 3.2 Operating System 

m 98 MC/OS was selected as the operating system for the MCW-1 board because it 

f J 99 provides the low latencies and high I/O and IPC bandwidths required for these 

^ 100 sorts of algorithms, and also because it already provides support for most of the 

i-p 101 hardware being incorporated on the MC W- 1 board. 

f gl 102 The MUD application can be kept as portable as possible by minimizing the 

l§ 1 03 use of non-POSIX MC/OS system calls, and encapsulating calls into proprietary 

y 104 MC/OS interfaces such as DX. 

^ 105 MC/OS requires the presence of a host computer system, which in this case 

f 3 1 06 will be a Motorola 8240 PowerPC processor running the Vx Works operating 

y 107 system. 

:p 
o 
m 
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3.3 IPC 



The MC/OS DX subsystem will be used for IPC within the application. This 
API provides low overhead, low latency access to the Mercury DMA engines, 
which in turn provide high bandwidth transfers of data. DX will be used to move 
data between the G4 compute elements during parallel processing, and also will 
be used to move data between the MC/OS compute elements, the VxWorks host 
computer, and the motherboard modem card. 



3.4 no 



Input / Output between the MUD card and the motherboard modem card takes 
place by moving data between the Race++ Fabric and the PCI bus via the PXB++ 
bridge. The application will use DX to initialize the PXB++ bridge, and to cause 
input/output data to move as if it were regular DX IPC trafiRc. 

Discussions with the customer need to take place in order to determine exactly 
how data flows over the PCI bus. For instance, it is currently unclear who will 
initiate data transfers, and how the initiator will know which PCI addresses should 
be involved in the transfer. A number of meetings with the customer are required 
to resolve these issues. 
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127 3.5 High Availability 

128 The approach to high availability on the MCW-1 card is to do most of the high 

129 availability processing at a time when the application is not running. Specifically, 

130 faults are handled by rebooting the system (fairly quickly). When the system 

131 comes up, the application can determine which processing resources are available, 

1 32 and it is up to the application to detennine how to map its processing needs onto 

133 the available resources. 

134 This approach to high availability means that there are short interruptions in 

135 service, but that the application does not need to know how to continue execution 

136 across faults. For instance, the application can make the assumption that the 

1 37 hardware configuration will not change without the system first rebooting. 

138 If the application has state which needs to be preserved across reboots, the 

1 39 application is responsible for checkpointing the data on a regular basis. The 

140 system software will provide an API to a portion of the non-volatile RAM for this 

141 purpose. It should be noted that the non- volatile RAM is quite small, and that 

142 storage of more than a few hundred bytes of data will require another mechanism 

143 to be put in place. 

^3 144 4 Operating System Environment 



fU 



j|; 145 4,1 Overview 

1 46 Mercury Computer Systems, Inc. has historically had the concept of a host 

fi 147 computer system. This dates back to the days when Mercury produced array 

148 processors that were attached to customers' mainframe computers. The evolution 

149 of Mercury multicomputers has left a vestigial host that often performs little more 

150 service than as a bootstrap device for the multicomputer. 
. ^ 151 The host computer system survives in the MCW-1 design primarily as a way to 

152 reduce schedule risk. The existence of a host computer system is assumed in so 

1 53 many ways by the existing Mercury software, that it would add significant 

154 schedule risk to attempt to remove this assumption in the MCW-1 timeframe. 

155 In the MCW-1 board, the host system performs the following fimctions: 

156 •It configures the Compute Elements, Fabric, and Bridges 

157 •It loads executable code into the Compute Elements 

158 •It serves as a bridge to the TCP/IP internetwork 

159 •It serves as a file system daemon 

1 60 •It runs some of the application software 

161 •It manages some of the specialized high availability hardware 

162 4.2 Bootstrap 

163 The host computer system is based on a Motorola 8240 PowerPC processor on 

164 the MCW-1 board. The 8240 is attached to an amount of linear flash memory. 

165 This flash memory serves several purposes. 

166 The first purpose the flash memory serves is as a source of instructions to 

167 execute when the 8240 comes out of reset. Linear flash is flash which can be 

1 68 addressed as if it was normal RAM. Flash memories can also be organized to look 

169 like disk controllers; however in that configuration they require a disk driver to 
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provide access to the flash memory. Although such an organization has several 
benefits such as automatic reallocation of bad flash cells, and write wear leveling, 
it is not appropriate for initial bootstrap. 

The flash memory also serves as a file system for the host (see Section 4.6), 
and as a place to store board permanent information (such as a serial number). 
Refer to the function design specification (TBS) for more details on how flash 
memory is used. 

When the 8240 first comes out of reset, memory is not turned on. Since high 
level languages such as C assume some memory is present (for a stack, for 
instance), the initial bootstrap code must be coded in assembler. This assembler 
bootstrap should only be a few hmdred lines of code, sufficient to configure the 
memory controller, initialize memory, and initialize the configuration of the 8240 
internal registers. 

After the assembler bootstrap has finished execution, control is passed to the 
MCW-1 H.A. code (which is also contained in boot flash memory). The purpose 
of the H.A. code is to attempt to configure the fabric, and load the compute 
element CPUs with H.A. code. Once this is complete, all the processors 
participate in the H.A. algorithm. The output of the algorithm is a configuration 
table which details which hardware is operational and which hardware is not. This 
is an input to the next stage of bootstrap, the Multicomputer Configuration. 

4,3 Multicomputer Configuration 

MC/OS expects the host computer system to configure the multicomputer. The 
configmc program reads a textual description of the computer system 
configuration, and produces a series of binary data structures that describe the 
computer system configuration. These data structures are used in MC/OS to 
describe the routing and configuration of the multicomputer. 

The MCW-1 board will use almost exactly the same sequence to configure the 
multicomputer. The major difference is that MC/OS expects configurations to be 
totally static, whereas the MCW-1 configuration will need to change dynamically 
as faulty hardware cause various resources to be unavailable for use. 

There are currently two proposals being considered for how this dynamic 
reconfiguration takes place. 

The first proposal is that the binary data structures produced by configmc are 
modified to include flags that indicate whether a piece of hardware is usable or 
not. A modification to MC/OS would prevent it from using hardware marked as 
broken. The risk here is that the modifications to MC/OS may be non-trivial. The 
benefit may be faster reboot times. 

The second proposal is that the output of the H.A. algorithm is used to produce 
a new configuration file input to configmc, the configmc execution is repeated 
with the new file, and MC/OS is configured and loaded with no knowledge of the 
broken hardware whatsoever. This proposal has the added benefit that configmc 
may be able to calculate the most optimal routing tables in the face of failed 
hardware, minimizing the performance impact of the failure on the remaining 
components. 1 his proposal provides risk reduction given that MC/OS changes 
would not be required. 
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4.4 Multicomputer Loading 

After the host computer has configured the multicomputer, the runmc program 
loads the functional compute elements with a copy of MC/OS. The only changes 
required for the MCW-1 board is for the loading process to examine which 
hardware may be offline because it is faulty, and take this into accoxmt when 
determining which compute elements need to be loaded. 

4.5 TCPflP Bridge 

We believe that the customer is likely to require access to the MCW-1 board 
from a TCP/IP network. MC/OS nodes do not contain a TCP/IP stack; therefore 
the host computer system acts as a connection to the TCP/IP network. The 
VxWorks operating system contains a fully functional TCP/IP stack. All currently 
envisioned daemons that need access to the TCP/IP network will run on the host 
processor. Should the need arise for compute elements to access network 
resources, the host computer would have to act as a proxy, exchanging 
information with the compute element utilizing DX transfers, and then making the 
appropriate TCP/IP calls on behalf of the compute element. 

4.6 File System 

The host computer system needs a file system to store configuration files, 
executable programs, and MC/OS images. Rotating disks have insufficient MTBF 
times; therefore flash memory will be utiHzed. Rather than have a separate flash 
memory from the host computer boot flash, the same flash is utilized for both 
bootstrap pinposes and for holding file system data. A commercial flash file 
system will be purchased and ported which provides DOS file system semantics 
as well as write wear leveling. Wear leveling attempts to spread the number of 
writes evenly across the sectors of flash memory, as flash memory can only be 
written a finite number of times before it is worn out. Modem flash devices can be 
written around 100,000 times before they are worn out. 

4.7 Remote Software Upgrade 

The current design of the MCW-1 board assumes that the customer will want 
to update system and application code in the field, via network. There are two 
portions of code which need to be updated - the bootstrap code which is executed 
by the 8240 processor when it comes out of reset, and the rest of the code which 
resides on the flash file system as files. 

When code is initially downloaded to the MCW-1, it is written as a group of 
files within a directory in the flash file system. A single top level file keeps track 
of which directory tree is used to boot the system. This file continues to point at 
the existing directory tree until a download of new software is successfully 
completed. When a download has been completed and verified, the top-level file 
is updated to point to the new directory tree, the boot flash is rewritten, and the 
system can be rebooted. 

A possible problem in multi-board systems is how to deal with different 
versions of released software on different boards. For instance, if board 1 has 
revision LO of the software distribution, and board 2 has revision 1.1 of the 
software distribution, will the two versions work together, or will there be a way 
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259 to ensiire that the same version of software is installed on all boards. This issue 

260 does not occur on the MCW-1 because it is a single board solution; therefore this 

261 issue can be addressed at a later time. 

262 A commercial solution to remote software upgrade is available, and has been 

263 ported to Vx Works. It is om intent to port this code at a future date. 

264 5 High Availability 

265 5.1 Goals 

266 The goal of tfie high availability features of the MCW-1 is to increase the 

267 MTBF of the system as much as possible with little or no increase in cost to the 

268 board. The requirement for minimal cost increase rules out such common 

269 approaches as hot or cold standby, replicated hardware, etc. 

270 It is not a goal to provide uninterrupted computing during hardware or software 

27 1 failures, nor is it a goal to provide fault tolerance. 



4r 35.2 Fault Detection & Isolation 

Fault detection is performed by having each CPU in the system gather as much 
information about what it observed during a fault, and then comparing the 
information in order to detect which components could be the common cause of 
the symptoms. In some cases, it may take multiple faults before the algorithm can 
detect which component is at fault. The requirement not to add expensive 
hardware for fault detection means that in many cases the algorithm will not be 
able to determine which component is at fault. 

The MCW-1 board has many single points of failure. Specifically, everything 
on the board is a single point of failure except for the compute elements. This 
means that the only hard failures that can be configured out are failures in the 
compute elements. However, many failures are transient or soft, and these can be 
recovered from with a reboot cycle. Therefore, we expect the high availability 
features to have a positive effect on the MTBF of the card. 

More detailed information is available in the functional design specification 

288 5,3 Degraded Application 



289 In the case of hard failures of a compute element, the application will have to 

290 execute with reduced demand for computing resources. There are several 

291 strategies possible for the MUD algorithm to decrease computing demands, such 

292 as working with a smaller number of interference sources, or performing a less 

293 complete job of interference cancellation. 

294 We expect the computing requirements of the algorithm to be high enough that 

295 failure of more than a single compute element will cause the board to be 

296 inoperative. Therefore, the MCW-1 application only needs to handle two 

297 configurations: all compute elements functional and 1 compute element 

298 unavailable. We believe that a small amount of startup code can map the 

299 application onto the two possible configurations. Note that the single crossbar 

300 means that there are no issues as to which processes need to go on which 

301 processors - the bandwidth and latencies for any node to any other node are 
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302 identical on the MC W- 1 . This will not be true of larger systems in the future, and 

303 we will eventually need a way to map computing and I/O requirements onto 

304 arbitrary hardware configurations. 

305 5.4 Remote Software Upgrade 

306 Downtime due to the updating of software is counted against the availability of 

307 a computer system, and therefore a remote reload of software is a necessity. The 

308 MCW-1 is capable of downloading new software during normal operation. The 

309 reboot strategy means that the downtime due to starting up new software is only a 

310 few seconds. 
311 

312 Referenced Documents 

313 

314 1 . "MC/OS High Availability Functional Design Specification", Yevgeniy 

315 Tarashchanskiy, 17 April, 2000. 
316 
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3 MERCURY PART NUMBER 

The board identifier name is MCW-la and the Mercury part number is 560549. 
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4 FUNCTIONAL DESCRIPTION 

4.1 OVERVIEW 

The MCW-la is designed to be an algorithm processing daughter card utilizing the MPC7400 PPG, MPC8240, 
PCE133 ASIC, XBAR-H- ASIC, and PXB-h- FPGA. The MCW-lmates with a Motorola base station modem board. 
MCW-la can provide additional connecti\dty between processing elements in different sector slots utilizing over-flie-top 
RACEway++ cables. It is a Motorola form factor card with four computational nodes and one host node. The 
computational nodes (CNs) are based on the latest MPC7400 PPC microprocessor and the host is an MPC8240. The 
MCW-lcan provide one Ethernet 10/100 BT port on the front panel. A 32-bit, 66 MHz PCI interface provide flie 
interface to llie Motcn-ola board. 

The MCW-la block diagram is shown in Figure 1. 
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Figure 1 . MCW-1 A BLOCK DIAGRAM 
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Figure 2 shows the MCW- la system topology. Table 1 gives the proposed route codes for the board. 




Is Fieare2. MCW-lA BOARD-Level TOPOLOGY 

y 



5 

li Table 1. Route Codes for MCW-la Board XBAR 



Route Code 


Destination for Virtual Ports 


Physical XBAR 1 Ports 


0 






1 






2 






3 






4 






5 






6 






7 






8 






9 






10 






11 






12 






13 






14 






15 
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4.2 FEATURES 

• Custom size daughter card 

• Master PCI 32-bit @ 66MHz compliant with REV2.2 PCI local bus spec. 
PCI write peak performance is 240 MB/sec. 

PCI read peak performance is 220 MB/sec. 

• Single IEEE802.3 compliant Ethernet 1 0BASE-T//1 OOBASE-T 

• Four computation nodes (CNs) based on MPC7400 PPC running @ 400 MHz. 
1 MB L2 cache per CN @ 200 MHz to 266 MHz. 

128 MB SDRAM with ECC per CN @ 133 MHz. 
Hardware based watchdog monitor. 
One PCE133 ASIC per CN. 

• Two, over-the-top, 66 MHz RACEway-H- interlink ports 
configured in cable mode. 

• PCI interfece 32-bit @ 66 MHz. 



1^ 



^0 



I'll 



m 



RACEway-H- crossbar to connect nodes. 
PXB-H- 64-bit @ 33 MHz PCI bus. 



• Non-transparent 64-bit/33 MHz to 32-bit/66 MHz PCI bridge. 

m 

• 200MHz PPC8240 PowerPC processor. 
%% 32-bit 33MHz pa bus. 

lOOMHz, 64Mbytes SDRAM. 



Bulk FLASH interface. 
Linear address mode. 
32 banks of IMbytes. 

LEDs. 

SKbytes non-volatile SRAM. 
Real time clock. 

Compute node fault isolation control. 
JTAG test port. 
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4.3 CONFIGURATION OPTIONS 

4.3.1 CPU Options 

• MPC7400 @ 400 MHz. 

• MPC7410@400MHz. 

4.3.2 SDRAM Options 

• 1 28 MB SDRAM @ 1 33 MHz with EGG. 

4.3.3 FLASH Memory Options 

• 1 6 MB FLASH memory. 

• 32MB FLASH memory 

43.4 Ethernet Options 

• No Ethernet. 

• Simgle Ethernet 



1^ 
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4.4 REQUIREMENTS 

4.4.1 Mechanical Form Factor 

The MCW-la form factor conforms to TBD Motorola mechanical requirements. 

4.4.2 Power Requirements 

The MCW-la requires +5.0 volts from the modem board. The +1.5V to +2. IV MPC7400 core voltage required by the 
core of MPC7400 is converted from +5.0V on the board. There are two core supphes used to power the four cpu cores. 
The 2.5V voltage required is converted from +5.0V by an onboard power supply. The 3.3V voltage required is also 
converted from +5.0V by an onboard power supply. The MCW-la estimated typical power dissipation is 50 watts @ 
5.0V. 

4.4.3 Electrical Interface 

The MCW-la provides a PCI 32-bit, 66 MHz interface to the Motorola modem board via an 80-pm connector. 

The MCW-la provides two over-the-top RACEway-H- ports via two connectors located on the front panel. 

|,|.. The MCW-la provides the single Ethernet 10/100 BT interface available from one RJ-45 connector. The Ethernet 

12 interface is provided by a third party Ethemet-to-PCI interface controller chip that is bridged to the crossbar 

f 5 RACEway++ port by means of a PXB++ FPGA (See Figure 2). 

4.4.4 Functional 

t& 1 . Shall have the Main SDRAM memory at 1 33MHz or greater, 

CO 2. Shall have a 1Mbyte L2 Cache at 200MHz or greater. 

Ill 3. All CE nodes shall have 1 28Mbyte of SDRAM. 

4. Host node shall have at least 32Mbytes of nonvolatile memory. 



Form factor requirements: 

5. Shall be a daughter card that is Va of a Motorola proprietary form factor modem payload card sized 1 1" by 14". On 
20mm centers board to board. {actual shape, dimensions etc TBD via drawings from Motorola.} 

6. Shall be electrically a PMC module, TBD from iurther discussions with customer. 

7. Shall use P 1 , P2 for 32/66MHz PCI bus. 

8. Shall have a maximum heat dissipation of 50W 

System requirements 

9. A minimum of 105Mbyte/sec from the modem payload module to Ihe MCW-la card shall be provided through the 

PCI interface. 

10. From the MCW-la card to Motorola Modem Payload module output bandwidth shall be at least 200kbyte/sec, 
concurrent with the 1 05Mbyte/sec input. 

11. The system shall have a bandwidth of at least 250Mbyte/sec between CE*s, e.g. RACE-H- at 66Mhz, as a nummum. 

12. Shall have non-volatile memory, for at least 32Mbytes of data. 

13. Shall support software upgrade from remote locations. 



4.5 COMPATIBILITY 

The MCW-la board is a custom daughter card designed for the Motorola base station modem board. 

4.6 PERFORMANCE 

The PCI bus standard and the PXB-H- FPGA limits the RACEway-H- to the PCI performance. Peak transfers of 240 
MB/sec are achievable between the PXB-H-, PPC8240 and the non-transparent PCI Bridge. (See Figure 1) 
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Data transfers of up to 266 MB/sec peak are supported for access from RACEway++ to/from the MPC7400 CE's local 
SDRAM memory. 

PCE133 ASIC-initiated DMA transfers run at optimum RACEway++ speeds approaching 266 MB/sec peak. Data can 
be transferred with the DMA from a single DMA command transfer to/from the CN's local SDRAM memory to/from 
RACEway++. The DMA engine formats transfers across RACEway++ optimally using packets up to 2048 Bytes. 

The operating clock frequency of the PCE133 ASIC, SDRAM, and MPC7400 processor bus is 133 MHz. Likewise, the 
operating frequency for the RACEway++ is 66 MHz. The local PCI clock is used by the correspondmg PXB++ FPGA 
and does not exceed 33 MHz. 

A separate 25 MHz oscillator is included on the MCW-la for driving the Ethernet interface. 

4.7 DETAILED DESCRIPTION 

The MCW- 1 a block diagram is shown in Figure 1 . 

4.7.1 Modem Board Interface 

TBD (PCI 32-bit 66MHz). 
I^^' TBD PCI to PCI bridge stuff 

TBD Motorola requirements. 



13 



m 
m 



4.7.2 Board Resets 

There are several sources of reset to the daughter card. A MAX823 voltage supervisor will generate a 200ms 
; . reset after VCC rises above 4.38 volts. When the MAX823 reset is deasserted, state machine logic will 

13 monitor PCI_RESET_0. The state machine will continue driving RESET_0 until both the MAX823 and 

IJ PCI_RESET_0 are deasserted. Either reset will generate the signal RESET_0 which will reset the card into its 

;m= power-on state. RESET_0 will also generate the HRESET_0 and TRST signals to the five CPUs. HRESET_0 

V and TRST for each of the cpus can also be generated by their JTAG ports; JTAG„HRESET_0 and 

% JTAGJTRST respectively. The MCP8240 is capable of generating a reset request, a soft reset (C_SRESET_0) 

Jf! to each CPU, a checkstop request, and a CE ASIC reset (CE_.RESET_0) to each of the four CE ASICs. A 

discrete from the 5v powered reset PLD will generate the signal NPORESET.l (not a power on reset). This 
signal is fed into the MPC8240's discrete input word. The MPC8240 will read this signal as a logic low only if 
it is coming out of reset due to either a power condition or an external reset from offboard. Each node, as well 
as the MPC8240 may request a board level reset. These requests are majority voted, and the resuU 
RESETVOTE_0 will generate a board level reset 
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Figure 3 shows the MCW-la hard reset generation function 
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Figure 3. HARD RESET FUNCTIONAL BLOCK DIAGRAM 



MOW- 1 a Functional Specification Created on 2/2/0 1 

- 14- 



Page No. 29 



EV 093 931 797 US 

^^^e No. 56,,.,^^^^^^^^^^^^^ COMPANY CONFIDENTIAL 



4.73 Watchdog Monitor ui 

There are five independent watchdog monitors on the MCW-la card. Each processor node is responsible tor 
strobing its watchdog once every 20 msec (initial window after board level reset is 2 sec) but no sooner than 
500 usee Strobing the watchdog for the processing nodes is accomplished by writing a zero/one sequence to 
the DIAG3 discrete coming from the PC133PCE ASIC, The MPC8240's watchdog is serviced by writing to 
the memory mapped discrete location FFFF_D027. A single write of any value will strobe the watchdog. Upon 
power-on, the watchdogs come up in a failed state; once a valid strobe is issued; the watchdog will be satisfied. 
If the CPU fails to service the watchdog within the valid window, the watchdog will fail. A watchdog of a 
failing processing node will trigger an interrupt to the MPC8240. An MPC8240 watchdog feult will trigger a 
reset to the board. The watchdog will then remain in a latched failed state until a CPU reset occurs followed by 
a valid service sequence. Figure 4 shows a valid service sequences of the watchdog. 

Reset 



m 
m 
w 

u 



m 



Software service 



U" 



u 



-ss- 



2 seconds 



500 usee] 



20msec 



500 usee 20 msec 



Watchdog Fault conditions 

■ A service v^thin 500 usee of last service. 
. No service within 20 msec of last service. 



500 usee 20 msec 



|500,usec ~ 20 msec^ 



Figure 4. EXAMPLE WATCHDOG SERVICE SEQUENCES 



4.7.4 Operating Frequency 

The MPC7400 bus runs at 133 MHz. The L2 cache bus of the MPC7400 runs at 200 MHz to 266 MHz. The SDRAMs 
run at 133 MHz. The RACEway-H- interface runs at 66 MHz. The local PCI bus runs at 33 MHz and the offboard PCI 
runs at 66MHz. The MPC8240's internal fi-equency is 200 MHz while its SDRAM interface is 100 MHz. 

4.7.4.1 Clock Margining 

This card has two crystal oscillators for the three clock domains present on the card, a 66 MHz oscillator for the 
RACEway++ interface and MPC7400 CNs. The 66MHz frequency is divided in half to generate a 33 MHz signal for 
the PCI interface. A second oscillator, 25 MHz, clocks the Ethernet and watchdog circuitry. Both the PCI and MPC 
clocks are marginable. In order to provide clock margining, a 4-pin connector allows the test engineer to functionally 
disable the onboard oscillator and replace it with a test fi^equency. The pinout of this connector is detailed in Table 2. 
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Table 2. Test Clock Connector 



Pin 


Signal 


1 


GND 


2 


/Test Clock 


3 


Test Clock 


4 


Test Clock Enable L 



4.7.5 Serial Configuration EEPROM 

There are several serial EEPROMs used to loadconfiguration to the CE ASICs, PXB++ and XBAR++ after reset. The 
serial PROM ftinctionality can be found in the ASIC's functional specification. 

4.7.5.1 CE ASIC Serial EEPROM . 

The serial EEPROM can be read and programmed by means of the RACEway++ bus. It is programmed dunng 
1^ manufacnire of the MCW- 1 a to contain configuration information for CE ASIC. The serial EEPROM AT24C 1 28 is 

1*1: controlled from the CE ASIC. After reset, the CE ASIC automatically reads the first location fi-om the senal EPROM. 

f 5 Refer to the CE ASIC fimctional specification, reference 3, for information on reading and writing this device. 

^0 

* 4.7.5.2 PXB++ FPGA Serial EEPROM 

W The serial EEPROM can be read and programmed by means of the PCI bus or the RACEway-f+ bus. It is programmed 

IB during manufacture of the MCW-la to contain configuration information for PXB. The serial EEPROM AT24C128 

device is 128K bits and is controlled from the PXB++. After reset, the PXB++ automatically reads 8 KB fi-om the senal 
EEPROM and initializes the PXB++ internal registers. Refer to the PXB++ FPGA fimctional specification, reference 5, 
for information on reading and writing this device. 

4.7.5.3 XBAR++ ASIC Serial EEPROM 

The serial EEPROM can be read and programmed by means of the RACEway++ bus. It is programmed dunng 
£ manufacture of the MCW-la to contain configuration information for XBAR++. The serial EEPROM AT24C128 is 

f«1 controlled ft-om the XBAR++ ASIC. After reset, the XBAR-H- ASIC automatically reads fi-om the senal EPROM and 

f II initializes the XBARh- internal registers. Refer to the XBAR++ ASIC fimctional specification, reference 4, for 

information on reading and writing this device. 



4.7.5.3.1 Register Description 

Reference 4 f describes the registers of the XBAR-H- ASIC. 

4.7.6 RACEway-H- Interconnect 

Communication between all processing and I/O elements on the system card is provided by a Mercury eight-port 
crossbar XBAR++ ASIC. The XBAR-H- provide up to three simultaneous 266 MB/sec peak throughput data paths 
between elements for a total peak throughput of 798 MB/sec. Three crossbar ports connect to the RapidIO Bridge 
FPGA. Each MPC7400 CN uses one crossbar port. The Ethernet and MPC8240 interface to a crossbar port through the 
PXB++. (See 0) Reference 4 describes the operation and registers of the XBAR-I~l- ASIC. 

4.7.7 Local PCII/O Bus 

The PXB-H- FPGA provides the local PCI I/O bus. This bus is accessible by means of the RACEway++ fiom the 
processing nodes. All resources on this bus are initialized and controlled by the MPC8240. This bus provides access to 
an Ethernet controller, PCI to PCI transparent bridge and the PPC8240 host controller. Transfers fi-om devices on this 
local PCI bus to and from devices on the RACEway-H- can achieve 240 MB/sec for writes and 220 MB/sec for reads. 
These rates assume block transfers of reasonable size. 
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4.7.7.1 PXB++ Program EEPROM 

The PXB-K- FPGA is programmed by an XC18V04 configuration EEPROM running in parallel mode. Configuration 
initiates when a power-on or board level reset occurs. Dividing the onboard 33MHz generates the configuration clock of 
16.6MHz. The configuration EEPROM itself is onboard programmable through the JTAG scan chain. 



4.7.7.1.1 Register Description 

Reference 5 describes the registers of the PXB-H- ASIC. 



4.7.8 Ethernet Interface 

The PCI-to-Ethemet interface uses the AM79C973 Pcnet-FAST III single chip 10/100 Mbps Ethernet controller. This 
device is equipped with a built in physical layer interface to achieve a minimal parts count Ethernet interface. A 25 MHz 
oscillator provides the proper clock frequency to the Ethernet chip. The PCI interrupt from the Ethernet chip is wired to 
the MPC8240's external interrupt controller, 

C3 4.7.9 MPC7400 or Nitre Computer Nodes (CNs) 

Q The board contains four MPC7400 CNs. Each MPC CN uses a PCE133 ASIC to interface the cpu to RACEwayf+. The 

PCE133 ASIC provides all the standard features of a CN, such as a DMA engine, mail box interrupts, timers, 
2^ RACEway-i-+ page mapping registers, SDRAM interface, and so on. Local memory for each CN consists of 32, 64, or 

5j 128 MB SDRAM, and L2 cache SRAM. Each CN also has a nonvolatile SRAM and watchdog monitor. The cpu bus is 

-f'f 64-bit data, 32-bit address, and operates synchronously at 133 MHz. 

y 



w 
1^ 



4.7.9.1 Processor 

The MCW-1 a card is designed to use either the 400 MHz MPC7400 or the 400 MHz Nitro processors. The processor is 
packaged in a 25mm, 360-ball CBGA package. Each processor requires the attachment of a heat sink to keep it within 
its thermal limits 



D 4.7.9.2 MPC7400 L2 Cache 

fIJ The MPC7400 L2 cache for each CN is composed of pipelined, single-cycle deselect, sync burst SRAM. This is 

implemented using two 64K, 128K, or 256K by 36-bit sync burst SRAM parts to make a 0.5 MB, 1 MB, or 2 MB L2 
cache. MPC7400 L2 cache can be depopulated to 0 MB. 

4.7.9.3 PCE133 ASIC 

The MPC processor compute element ASIC (PCE133 ASIC) is a Mercury-designed component. It provides the 
interface between the MPC7400, the synchronous DRAM, and the RACEway-H-. All the PCE133 features such as 
DMA, mailbox interrupts, timers, address snooping, prefetch buffers, and so on, are available in this configuration. This 
chip is provided in a 35mm, 388-ball BGA package. Reference 3 describes the operation and registers of the PCE133 
ASIC. 



4.7.9.3.1 Register Description 

Reference 3 describes the registers of the PCE133 ASIC. 

4.7.9.4 Address Map 
4.7.9.4.1 Master Address Map 
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Transfers from the MPC7400 to the PCE133 ASIC and RACEway++ are address mapped as shown in Table 3. 
The SDRAM is 8-, 16-, 32-, or 64-bit addressable. RACEway++ locked read/write and locked read 
transactions are supported for all data sizes. The 16 Mbyte boot FLASH area is fiirther divided in Table 4 



1- 

a 
a 
%o 
%o 
m 
m 
w 

la 
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Table 3. Master Address Map 



Prom Arlrlri^Q^ 




Pi motion 


Ovonort 0000 


OvOFFF FFFF 


T r»ral ^r>R A\A MR 


Oy 1 onn noon 


ftv 1 FPF FFFF 


VRAf^ AyfR man unn/inw 1 


0x9000 0000 


OxlFFF FFFF 


VRAl? O^fi miin u/inHnw 9 
J\.Or\Pi. £,J\} IVIJD Illap WlilUUW Zr 


Ox 000 0000 


Oy'^FFF FFFF 


XRAl? O^ft MR man wirninw 
y\L>r\M\. iviD lila^ Wiliuwv j 


0x4000 0000 


0x4FFF FFFF 


XBAR 256 MB map window 4 


0x5000 0000 


OxSFFF FFFF 


XBAR 256 MB map window 5 


0x6000 0000 


0x6FFFFFFF 


XBAR 256 MB map window 6 


0x7000 0000 


0x7FFF FFFF 


XBAR 256 MB map window 7 


0x8000 0000 


OxSFFF FFFF 


XBAR 256 MB map window 8 


0x9000 0000 


Ox9FFF FFFF 


XBAR 256 MB map window 9 


OxAOOO 0000 


OxAFFF FFFF 


XBAR 256 MB map window A 


OxBOOO 0000 


OxBFFF FFFF 


XBAR 256 MB map window B 


OxCOOO 0000 


OxCFFF FFFF 


XBAR 256 MB map window C 


OxDOOO 0000 


OxDFFF FFFF 


XBAR 256 MB map window D 


OxEOOO 0000 


OxEFFF FFFF 


XBAR 256 MB map window E 


OxFOOO 0000 


OxFBFF FBFF 


Not used (CE reg replicated mapping) 


OxFBFF FCOO 


OxFBFF FDFF 


Internal CN ASIC registers 


OxFBFF FEOO 


OxFEFF FFFF 


Prefetch control 


OxFFOO 0000 


OxFFFF FFFF 


16 MB boot FLASH memory area 



Table 4. Boot FLASH Address Map 



From Address 


To Address 


Function 


OxFFOO 2006 


OxFFOO 2006 


Software Fail Register 


OxFFOO 2005 


OxFFOO 2005 


MPC8240 HA Register 


OxFFOO 2004 


OxFFOO 2004 


Node 3 HA Register 


OxFFOO 2003 


OxFFOO 2003 


Node 2 HA Register 


OxFFOO 2002 


OxFFOO 2002 


Node 1 HA Register 


OxFFOO 2001 


OxFFOO 2001 


Node 0 HA Register 


OxFFOO 2000 


OxFFOO 2000 


Local HA Register (status/control) 


OxFFOO 0000 


OxFFOO IFFF 


NovRAM 



4.7.9.4.2 Slave Address Map 

Slave accesses are defined as accesses initiated by an external RACEway-H- device directed toward the MPC7400 CN. 
The MPC is not accessible as a slave device. The SDRAM is 8-, 16-, 32-, or 64-bit addressable. RACEway-H- locked 
read/write and locked read are supported for all data sizes. The PCE RACEway port supports a 256 MB address space 
partitioned as follows in Table 5: 

Table 5. Slave Address Map 



From Address 


To Address 


Function 


0x0000 0000 


OxOFFF FBFF 


256 MB less 1 KB hole SDRAM 


OXffilFCOO 


OxFFF_FFFF 


PCE 133 intemal registers 



4.7,9.5 Interrupt 

Reference 3 describes the intemal interrupt sources for the PCE 133 ASIC. The external interrupt pin on the PCE 133 
ASIC is driven by the HA PLD and is currently not used. The interrupt output from the PCE 133 ASIC is wired to the 
CPU's external interrupt input pin. 
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4.7.9.6 PCE133 DIAG Bits 

The DIAG3 signal is wired to the HA PLD and is used to strobe the nodes hardware watchdog monitor. The DIAG2 
signal is wired to the MPC8240's interrupt controller and is used, by the node, to generate a general purpose interrupt to 
the MPC8240. The DIAGBIT signal is wired to the HA PLD and is currently not used. 

4.7.9.7 MPC7400 Reset 

The MPC7400 hard reset signal is driven by three sources gated together: the HRESET„0 pin on the PCE133 ASIC, 
HRESET.O from the JTAG connector, and HRESET_0 from the majority voter. The HRESET_0 pin from the CE ASIC 
is set by the "node run" bit field (bit 0) of the PCE133 ASIC's Miscon_A register. Setting HRESET_0 low causes the 
MPC7400 to be held in reset. HRESET_0 is low immediately after system reset or power-up, the MPC7400 is held in 
reset until the HRESET_0 line is pulled high by setting the node run bit to 1. The JTAG HRESET_0 is controlled by 
debugger software when a JTAG debugger module is connected to the card. The HRESET_0 from the majority voter is 
generated by a majority vote from all healthy nodes to reset. 

4.7.9.8 Boot Procedures 

When a cpu reset is asserted, the MPC7400 is put into reset state. The MPC7400 will remain in a reset state until the 
RUN bit 0 of the Miscon_A register is set to 1 and the MPC8240 has released the reset signals in the discrete output 
word. The RUN bit should be set to 1 after the boot code has been loaded into the SDRAM starting at location 
OxOOOO^OlOO. The ASIC maps the reset vector OxFFFO_0100 generated by the MPC7400 to address 0x0000 J 100. 

4.7.9.9 MPC7400 CN SDRAM 

The main memory for each CN is composed of one bank of synchronous DRAM. This is implemented using five 
K4S280832A-TC/L75 @1 33 MHz synchronous DRAM parts. As shown in the memory map (See Table 3), the main 
memory begins at address 0x0 and grows upward in the address space as memory is increased. The PCE133 ASIC 
supports error correction (ECC) on the SDRAM. 

The SDRAM operates as zero wait state memory and can provide up to 1 GB/sec peak bandwidth on writes from 
MPC7400 and 800 MB/sec peak bandwidth on read from the MPC7400. ECC error correction is supported. 

4.7.9.10 MFC7400 Non-Volatile RAM 

Each node will be equipped 8Kx8 of non-volatile RAM for the storage of fault record data and configuration 
information. This ftinction is implemented using a SIMTEK STK12C68S45 NOVRAM attached to the PCE133 ASIC's 
boot FLASH interface. The data bus of the device is isolated from the PCE ASIC through an IDT IDTQS32244SO 
buffer. This buffer provides loading isolation and 3.3v to 5v translation. 
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4.7.10 MPC8240 Host Controller ^ x>it^^^ao t> lyn 

The MPC8240 integrated processor is comprised of a peripheral logic block and a 32-bit embedded MPC603e FowerFC 
processor core. The peripheral logic integrates a PCI bridge, memory controller, DMA controller, EPIC interrupt 
controller, a message unit, and an I2C controller. The processor core is a foil featured, high-performance processor with 
floating-point support, memory management, 16Kbytes instruction cache, 16Kbytes data cache, and power management 
features. 

Major features of the MPC8240 are as follows: 
Peripheral logic 

- Memory interface 

High-bandwidth bus, 64-bit data bus, to SDRAM. 
ECC Protected SDRAM 
1 6 Mbytes of ROM space (32Mbytes paged). 
8-bit ROM. 

Write buffering for PCI and processor accesses. 

- PCI Interface 

1^ 32-bit PCI interface operating at 33 MHz (66 MHz capable). 

13 PCI 2.1 -compatible. 

m Support for accesses to all PCI address spaces. 

Selectable big- or little-endian operation. 

Store gathering of processor-to-PCI write and PCl-to-memory write accesses. 
PCI bus arbitration unit (five request/grant pairs). 

- Two-channel integrated DMA controller 

f§ Supports direct mode or chaining mode (automatic linking of DMA transfers). 

U Supports scatter gathering read or write discontinuous memory. 

^ Interrupt on completed segment, chain, and error. 

I^ste Local-to-local memory. 

f PCI-to-PCI memory. 

J^*' PCI-to-local memory. 

Local-to-PCI memory. 
*p - Message unit 

13 Two doorbell registers, 

f U Inbound and outbound messaging registers. 

1 2 O message controller. 
- 1 2 C controller with full master/slave support 

- Embedded programmable interrupt controller (EPIC) 

Five hardware interrupts (IRQs) or 16 serial interrupts. 
Four programmable timers. 

- Integrated PCI bus and SDRAM clock generation 

- Programmable memory and PCI bus output drivers 

- Debug features 

Memory attribute and PCI attribute signals. 
Debug address signals. 

MIV signal: Marks valid address and data bus cycles on the memory bus. 
Error injection/capture on data path. 
IEEE 1 149.1 (JTAG)/test interface. 
Processor core 

- High-performance, superscalar processor core 

Integer unit (lU). 

Foating-point unit (FPU) (user enabled or disabled). 
Load/store unit (LSU). 
System register unit (SRU). 
Branch processing unit (BPU). 
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- 16-Kbyte instruction cache 

- 16-Kbyte data cache 

- Lockable LI cache - entire cache or on a per-way basis 

- Dynamic power management 



4.7.1 0.1 Address Map 

The MPC8240 in PCI host mode supports two address mapping configurations designated as address map A, and 
address map B. Address map A conforms to the PowerPC reference platform (PReP) specification. Address map B 
conforms to the PowerPC microprocessor common hardware reference platform (CHRP). Note that the support of map 
A is provided for backward compatibility only. It is strongly recommended that new designs use map B because map A 
may not be supported in future devices. 

Address map B complies with the PowerPC microprocessor common hardware reference platform (CHRP). The address 
space of map B is divided into four areas: system memory, PCI memory, PCI I/O, and system ROM space. When 
configured for map B, the MPC8240 translates addresses across the internal peripheral logic bus and the extemal PCI 
bus as shown in Table 6. 



Table 6. MPC8240 Address Map B 



Processor Core Address Range 


PCI Address Range 


Definition 


Hex 


Decimal 


0000_0000 


0009_FFFF 


0 


640K- 1 


NO PCI CYCLE 


System memory 


O0OA_0000 


OOOF_FFFF 


640K 


lM-1 


0O0A_O000 - OOOF_FFFF 


Compatibility hole 


0010_0000 


3FFF_FFFF 


IM 


lG-1 


NO PCI CYCLE 


System memory 


4000_0000 


7FFF_FFFF 


IG 


2G-1 


NO PCI CYCLE 


Reserved 


8000_0000 


FCFF_FFFF 


2G 


4G^8M-1 


8000_0000 - FCFF^FFFF 


PCI memory 


FD0O_OOOO 


FDFF.FFFF 


4G-48M 


4G-32M-1 


0000_0000 - OOFF_FFFF 


PCI/ISA memory 


FEOO_0000 


FE7F_FFFF 


4G-32M 


4G-24M-1 


0000_0000 - 007F_FFFF 


PCI/ISA I/O 


FE80_0000 


FEBF_FFFF 


4G-24M 


4G-20M-1 


0080_0000 - OOBF_FFFF 


PCI I/O 


FECO_0000 


FEDF_FFFF 


4G-20M 


4G-18M-1 


CONFIG_ADDR 


PCI configuration address 


FEEO_0000 


FEEF_FFFF 


4G-18M 


4G-17M-1 


CONFIG.DATA 


PCI configuration data 


FEFO_0000 


FEFF FFFF 


4G-17M 


4G-16M-1 


FEFO_0000 - FEFF_FFFF 


PCI interrupt acknowledge 


FF00J)O00 


FF7F_FFFF 


4G-16M 


4G-8M-1 


FFOO^OOOO ~ FF7F_FFFF 


32/64-bit FLASH/ROM (1) . 


FF80_0000 


FFFF_FFFF 


4G-8M 


4G-1 


FF80 0000-FFFF_FFFF 


8/32/64-bit FLASH/ROM (2) 



Notes: 

(1) This bank of FLASH is not used. 

(2) This bank of FLASH is configured in 8-bit mode and is further broken down in Table 7. 



Table 7. Port X Address Map 
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Bank 


Processor Core Address Range 


Definition 


Select 








11111 


FPFO 0000 


FFEF FFFF 


Accesses Bank 0 


11110- 


FFFO 0000 


FFEF_FFFF 


Application code (1) (30 pages) 


00001 








00000 




FFEF^FFFF - 


Application/boot code (1)^ (-23 






FFFF TFFF 


Application/boot code (gfea^./ 




r r r r 


FFFF DOOO 


Discrete input word 0 




T7T7PTr Finn i 
rrrr_LnJ\) I 


FFFF nOOl 


Discrete input word 1 






FFFF n002 


Discrete output word 0 




FFFF T^nO'^ 


FFFF DOO'i 
i r r i x^w J 


Discrete output word 1 




r r r r _Lf\j\J'f 


FFFF D004 


Discrete output word 2 




vvw Fin in 


FFFF no 10 


IC (Pending interrupt) 




rrrr_Lf\)i i 


FFFF F)01 1 


IC (Interrupt mask low) 




1717X717 T^A 1 0 


FFFF no 19 


IC (Interrupt clear low) 




T7T7T7T7 F^A 1 1 


FFFF Fiftl^ 


IC (Unmasked, pending low) 




t7T7T7T7 T\A1 A 


T7FFF FiAlzl 
r Jrr" r uv l f 


IC (Interrupt input low) 


XXXX (3) 




T7FFF Tifi 1 ^ 
rrr r_UKJ i D 


Unused (read FF) 




t7"CT?T7 "TiA 1 ^ 

rrrr_UUlD 


T7T7PP riA 1 A 


Unused (read FF) 




r'rrr_DUl / 


rrrr_LJ\ji I 


Unused (read FF) 




FFFF_D018 


FFFF_D018 


Unused (read FF) 




FFFF_D019 


FFFF_D019 


Unused (read FF) 




FFFF_D020 


FFFF_D020 


HA (Local HA register) 




FFFF_D021 


FFFF_D021 


HA (Node 0 HA register) 




FFFF_D022 


FFFF_D022 


HA (Node 1 HA register) 




FFFF__D023 


FFFF_D023 


HA (Node 2 HA register) 




FFFF_D024 


FFFF_D024 


HA (Node 3 HA register) 




FFFF_D025 


FFFF_D025 


HA (8240 HA register) 




FFFF_D026 


FFFF_D026 


HA (Software Fail) 




FFFF_D027 


FFFF„D027 


HA (Watchdog Strobe) 




FFFF„D028 


FFFF_DFFF 


4068 Bytes FLASH 




FFFF EOOO 


FFFF.FFFF 


8KNOVRAM 



Notes: 

(1) Thirtyone 1Mbyte blocks of application memory residing at address FFE0_0000 - FFEF_FFFF selected by the 
FLASH page bits. 

(2) 2Mbyte block available after reset. 

(3) Always available 



4.7.10.2 Register Description 

Reference 10 describes the registers of the MPC8240. 



4.7.10,3 Interrupt 

The MPC8240 contains an embedded programmable interrupt controller (EPIC) device. The EPIC implements the 
necessary ftmctions to provide a flexible and general-purpose interrupt controller solution. The EPIC pools hardware- 
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generated interrupts from many sources, both within the MPC8240 and externally, and delivers them to the processor 
core in a prioritized manner. The solution adopts the OpenPIC architecture (architecture developed jointly by AMD and 
Cyrix for SMP interrupt solutions) and implements the logic and programming structures according to that specification. 
The MPC8240's EPIC unit supports up to five external interrupts, four internal logic-driven interrupts and four timers 
with interrupts. See Reference 10 for a detailed description of the EPIC unit. 

The five external interrupt inputs to the EPIC are wired to the external interrupt controller PLD. 

4.7.10.4 MPC8240 Reset 

The MPC8240 can be reset from three sources: a board level reset (RESET_0), JTAG controlled reset, or a failure in 
it's watchdog monitor. Any reset to the MPC8240 shall cause the discrete output registers to reset (low) state, this in 
turn, will catise all G4 nodes to enter the reset state. 

4.7.105 Boot Procedure 

After the release of reset to the MPC8240, it will begin executing code out of the FLASH memory. A reset will 
automatically set the FLASHSEL(4:0) bits to all zero's, therefore, the MPC8240's boot code must reside in bank 0. 
Once it's application code is copied to SDRAM, the MPC8240 can then sequence through the FLASH banks by setting 
the appropriate bits in the discrete output word. Application code for the G4 nodes resides in the remaining thirtyone 
h^' banks of FLASH. 



m 



4.7.1 1 Bulk FLASH Memory 

There are 32Mbytes of bulk FLASH memory, comprised of two Intel 28F128J3 StrataFLASH memory devices. The 
MPC8240's memory map limits the size of the 8-bit wide FLASH to 2Mbytes, this requires hardware to divide the 
J^; FLASH into thirty-two 1Mbyte banks. Five software-controlled discretes allow switching between banks. Accesses to 

Itf the 1Mbyte address range of FFE0_0O00 through FFEF_FFFF will always access the first first block of FLASH, 

s NOVRAM,Discrete I/O, HA registers, watchdog monitor, and the interrupt controller. Accesses to the 1Mbyte address 

Q range of FFFO_0000 through FFFF_FFFF will access a page of memory in the FLASH. The actual page is selected is 

based on the five FLASH select bits, driven by the Discrete Output word. 



4.7.12 Real Time Clock 

y= The PCF8563 is a CMOS real-time clock/calendar optimized for low power consumption. A programmable clock 

1^^ output, interrupt output and voltage-low detector are also provided. All addresses and data are transferred serially via a 

f y two-line bidirectional I 2 C-bus. Maximum bus speed is 400 kbits/s. 

Real Time Clock Features: 

- Provides year, month, day, weekday, hours, minutes and seconds 

(Based on an external 32.768 kHz quartz crystal) 

- Century flag 

- Wide operating supply voltage range: 1 .0 to 5.5 V 

- Low back-up current; typical 0.25 mA at VDD = 3.0 V and Tamb =2 °C 

- 400 kHz two-wire I 2 C-bus interface (at VDD = 1.8 to 5.5 V) 

- Programmable clock output for peripheral devices: 32.768 kHz, 1024 Hz, 32 Hz and 1 Hz 

- Alarm and timer functions 

- Voltage-low detector 

- Integrated oscillator capacitor 

- Internal power-on reset 

- 1 2 C-bus slave address: read A3H; write A2H 

- Open drain interrupt pin 

4.7.13 Nonvolatile Memory 

The MPC8240 will be equipped with 8Kx8 of non-volatile RAM for the storage of fault record data and configuration 
information. This iunction is implemented using a SIMTEK STK12C68S45 NOVRAM attached to the local bus 
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interface. The device's data bus is isolated from the local bus through an IDT IDTQS32244SO buffer. This buffer 
provides 3.3v to 5v translation. 

4.7.14 Fault Status and Control Registers 

The MPC8240 has access to five 8-bit status registers. One register represents its own status while the others represent 
that fault status of the other four G4 CPUs. Each register has the identical format as shown in Table 8: 
These five registers grant the MPC8240 status information from each node on the board, without going through the 
Raceway fabric. 

The MPC8240 will have one 8-bit Fault control register. The control register for each CPU will have the following 

format as shown in Table 9: 



Bit 


Name 


Description 


0 


CHECKSTOP_OUT 


Checkstop state of CPU (0 = CPU in checkstop) 


1 


WDM_FAULT 


WDM failed (0 = WDM failed, set high after reset and valid service) 


2 


SOFTWARE_FAULT 


Software fault detected (Set to 0 when a software exception was detected) (RAV local) 


3 


RESETREQJN 


Wrap status of the local CPU's reset request 


4 


WDM„INIT 


WDM failed in initial 2 second window ( 0 = WDM failed) 


5 


Software definable 0 


Software definable 0 


6 


Software definable 1 


Software definable 1 


7 


unused 


unused 



Table 8. Fault Status Register Format 



w 

1*5: 

m 



Bit 


Name 


Description 


0 


RESETRBO_OUT_0 


Request a reset event (0 => forces reset) 


1 


CHKSTOPOUT_0 


Request that node 0 enter checkstop state (0 => request checkstop) 


2 


CHKSTOPOUT_l 


Request that node 1 enter checkstop state (0 => request checkstop) 


3 


CHKSTOPOUT_2 


Request that node 2 enter checkstop state (0 => request checkstop) 


4 


CHKSTOPOUT_3 


Request that node 3 enter checkstop state (0 => request checkstop) 


5 


CHKSTOPOUT_8240 


Request that the MPC8240 enter checkstop state (0 request checkstop) 


6 


Software definable 0 


Software definable 0 


7 


Software definable 1 


Software definable 1 



Table 9. Fault Control Register Definition 



4.7.15 Majority Voter 

There are two different ftinctions controlled by majority voters. The first is local to each CPU, this voter controls the 
assertion of CHECKSTOPJN to the CPU. The second voter is centralized to the board, it will control the master reset 
to the board. Both voters shall follow the same set of rules: The output will follow the majority of non-checkstopped 
CPUs. A 1-on-l or 2-on-2 condition in either voter will result in a board level reset. 
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4,7.16 Discrete I/O ^ ^ ^ 

There are 16 discrete output signals directly controllable and readable by the MPC8240. The 16 discretes are divided up 
into two addressable 8-bit words. Writing to a discrete output register will cause the upper 8-bits of the data bus to be 
written to the discrete output latch. Reading a discrete output register will drive the 8-bit discrete output onto the upper 
8-bits of the MPC8240*s data bus. Table 10 defines the bits in the discrete output word. 

There are 16 discrete input signals accessible by the MPC8240. Reads from the discrete input address space will latch 
the state of the signals, and return the latched state of the discretes to the MPC8240. Table 1 1 defines the bits in the 
discrete input word. 



m 
m 
y 
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Table 10. Discrete Output Words 



Word 2 


DH(0:7) 


Signal 


Description 


0 
1 
2 
3 
4 
5 
6 
7 


NDO FLASH_EN„1 
ND1 FLASH EN 1 
ND2 FLASH EN_1 
ND3 FLASH_EN^1 
Wrap 1 


Enable the CE ASIC's FLASH port when 1 
Enable the CE ASIC's FLASH port when 1 
Enable the CE ASIC's FLASH port when 1 
Enable the CE ASIC's FLASH port when 1 
Wrap to discrete input 



Word 1 


DH(0:7) 


Signal 


Description 


0 


WRAPO 


Wrap to Discrete Input 


1 


I2C RESET_0 


Reset the I2C serial bus when 0 


2 


SWLED 


Software controlled LED 


3 


FLASHSEL4 


Flash bank select address bit 4 


4 


FLASHSEL3 


Flash bank select address bit 3 


5 


FLASHSEL2 


Flash bank select address bit 2 


6 


FLASHSEL1 


Flash bank select address bit 1 


7 


FLASHSELO 


Flash bank select address bit 0 



WordO 


DH(0:7) 


Signal 


Description 


0 


C SRESET3 0 


Issue a Soft Reset to cpu on Node 3 when 0 


1 


C PRESET3_0 


Reset PCE133 ASIC Node 3 when 0 


2 


C SRESET2_0 


Issue a Soft Reset to cpu on Node 2 when 0 


3 


C PRESET2 0 


Reset PCE133 ASIC Node 2 when 0 


4 


C SRESET1 0 


Issue a Soft Reset to cpu on Node 1 when 0 


6 


C PRESET1 0 


Reset PCE133 ASIC Node 1 when 0 


6 


C SRESETO 0 


Issue a Soft Reset to cpu on Node 0 when 0 


7 


C PRESETO 0 


Reset PCE133 ASIC Node 0 when 0 



Table 11. Discrete Input Words 



Word 1 


DH(0:7) 


Signal 


Description 


0 


WRAP1 


Wrap from discrete output word 


1 


TBD 




2 


V3.3 FAIL_0 


Latched status of power supply since last reset 


3 


V2.5 FA1L_0 


Latched status of power supply since last reset 


4 


VCORE1_FAIL_0 


Latched status of power supply since last reset 


5 


VCOREO FAIL 0 


Latched status of power supply since last reset 


6 


RIOR CNF DONE_1 


R10/RACE++ FPGA configuration complete 


7 


PXBO CNF DONE_1 


PXB++ FPGA configuration complete 
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WordO 


DH(0:7) 


Signal 


Description 


0 


WRAPO 


Wrap from discrete output word 


1 


WDMSTATUS 


MPC8240's watchdog monitor status (0 = failed) 


2 
3 


NPORESET 1 


Not a power on reset when high 


4 

5 
6 
7 







4,7.17 Interrupt Controller 

The MPC8240 interfaces with an 8-input interrupt controller external from MPC8240 itself. The interrupt inputs are 
wired, through the controller to interrupt zero of the MPC8240 external interrupt inputs. The remaining four MPC8240 
interrupt inputs are unused. 

The Interrupt Controller comprises the following five 8-bit registers; 



Pending Register - A low bit indicates a felling edge was detected on that interrupt (read only) 

Clear Register - Setting a bit low will clear the corresponding latched interrupt (write only) 

Mask Register ~ Setting a bit low will mask the pending interrupt from generating an MPC8240 interrupt 

Unmasked Pending Register - A low bit indicates a pending interrupt that is not masked out 

Interrupt State Register - indicates the actual logic level of each interrupt input pin 



1^ 



m 



4.7.17.1 Interrupt Controller Operation 

Table 12 lists the interrupt input sources and their bit positions within each of the six registers. A falling edge on an 
interrupt input will set the appropriate bit in the pending register low. The pending register is gated with the mask 
register and any unmasked pending interrupts will activate the interrupt output signal to the MPC8240's external 
interrupt input pin. Software will then read the unmasked pending register to determine which interrupt(s) caused the 
exception. Software can then clear the interrupt(s) by writing a zero to the corresponding bit in the clear register. If 
multiple interrupts are pending, the soflrware has the option of either servicing all pending interrupts at once and then 
clearing the pending register or servicing the highest priority interrupt (software priority scheme) and the clearing that 
single interrupt. If more interrupts are still latched, the interrupt controller will generate a second interrupt to the 
MPC8240 for software to service. This will continue until all interrupts have been serviced. An interrupt that is masked 
will show up in the pending register but not in the unmasked pending register and will not generate an MPC8240 
interrupt. If the mask is then cleared, that pending interrupt will flow through the unmasked pending register and 
generate an MPC8240 interrupt. 



Table 12. Interrupt Controller Inputs 



Bit 


Signal 


Description 


0 


SWFAIL 0 


8240 Software Controlled Fail Discrete 


1 


RIG INT 0 


Real time clock event 


2 


NODEO FAIL 0 


WDFAIL_0 or IWDFAIL_0 or SWFAIL_0 active 


3 


NODE1 FAIL 0 


WDFAIL 0 or IWDFAIL^O or SWFAIL.O active 


4 


N0DE2 FAIL 0 


WDFAIL 0 or iWDFAIL 0 or SWFAIL 0 active 


5 


NODES FAIL 0 


WDFAIL_0 or IWDFAIL^O or SWFAIL_0 active 


6 


PCI INT 0 


PCI interrupt 


7 


XB SYS.ERR^O 


XBAR internal error 
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4,7 AS Configuration Junipers 

J 1 8- 1 — J 1 8-2, the watchdog monitor mask, when installed, will mask all watchdog failures. 

J 18-3 - J 18-4, the serial EEPROM's write enable jumper, when installed, enables modification of the serial EEPROMs. 
Jl 8-5 - J 18-6, the flash write-protect jumper, when installed, preyents modification of any flash memory location. 
J 1 8-7 - J 1 8-8, the PXBO use PROM jumper, when installed will enable the PXBO's serial configuration PROM. 
4.7.19 LEDs 

There are nine LEDs, visible at the top of the board. 

LDl is a software controlled LED 

LD2 is a software controlled LED 

LD3 is the Node 0 watchdog fail LED 
1^1, LD4 is the Node 1 watchdog fail LED 

ff^ LD5 is the Node 2 watchdog fail LED 

|5 LD6 is the Node 3 watchdog fail LED 

LD7 is the MPC8240 watchdog fail LED 
^0 LD8 indicates the state of the board level reset 

%B' LD9 indicates a XBAR system error. 



19 
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There are an additional two LEDs on the Ethernet connector for Ethernet status (located on the Ethernet connector). 



4,7,20 Power Supply 

The MCW- la board requires 3.3V, 2.5V, and L8V. There are two 1 .8V supplies, each drives the core voltage for two 
%U cpus. To provide power to the MCW- la, the three voltages must have separate switching supplies, and proper power 

sequencing to the device must be provided. All three voltages are converted from 5.0V. The power to the daughter card 
is provided directly from the modem board. 

4.7.20.1 MPC7400 Core Power Supply 

There are two core voltage power supplies, each one is dedicated to two MPC7400 PPC cores. The core voltage can be 
in the 2.2V to 1 .5V range. This power supply is rated at 12A in the range from 2.2V to 1 .5V. 

4.7.20.2 Main 3.3V Power Supply 

A 3.3V power supply is used to provide power to the SBSRAM core, SDRAM, SCSI, PXB-H-, and XBAR-H- PCE133 
I/O. This power supply is rated at TBD Amp. 

4.7.20.3 Core and I/O 2.5V Power Supply 

A 2.5V power supply is used to provide power to the PCE133 and can also power the PXB-H- FPGA core. The 
MPC7400 processor bus can run at 2.5V signaling. The MPC7400 L2 bus can operate at 2.5V signaling. This 2.5V 
power supply is rated at TBD Amp. 

4.7.20.4 ASICs Power Supplies Tolerance Requirements 

SBSRAM VDD = 3.3V+0.165V/-0.165V power supply 

SBSRAM VDDQ = 3.3V+0.165V/-0.165V for 3.3V I/O or 2.5 V+0.4V/-0. 125V for 2.5V I/O 
SDRAM VDD= 3.3V+0.3V/-0.3V power supply 
XBAR++ VDD= 3.3V+0.3V/-0.3V power supply 
PCE133 VDD= 2.5V+?V/-?V power supply 
PCE133 VDD33= 3.3V+?V/-?V power supply 
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4.7.20.5 Power Supply Voltage Sequencing 

The power sequencing is very important in multivoltage digital boards. It is necessary for long-term reliability. The right 
power supply sequencing can be accomplished by using power_good and inhibit signals. To provide fail-safe operation 
of the device, power should be supplied so that if the core supply fails during operation, the I/O supply is shut down as 
well. 



The general rule is to ramp all power supplies up and down at the same time. This is shown in Figure 5. In reality, ramp 
up and down depend on multiple factors: power supply, total board capacities that need to be charged, power supply 
load, and so on. Figure 6 shows ideal worst-case sequencing for ramp up and down that is performed by the protection 
sequencing circuits shown in Figure 7. This circuit keeps the voltage difference within the required range. 
The MPC7400 requires the core supply to not exceed the I/O supply by more than 0.4 volts at all times. Also, the I/O 
supply must not exceed the core supply by more than 2 volts. 
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Figure 7. VOLTAGE SEQUENCING CIRCUITS 

0, 7V voltage drops across one diode. 
During power up sequencing: 

Dl and D2 proyide the ramp up voltage for the 2.5V power supply as soon as the 3.3V power supply reaches 1 .4V. 
D3 and D4 provide the ramp up voltage for the 1.8V_1 power supply as soon as the 2.5V power supply reaches L4V. 
D7 and D8 provide the ramp up voltage for the 1.8V_2 power supply as soon as the 2.5V power supply reaches 1 .4V. 

During power down sequencing: 

D5 provides the ramp down for the 2.5V power supply as soon as the 3 .3 V power supply reaches 1 .8V. 
D6 provides the ramp down for the 1 .8V_1 power supply as soon as the 2.5V power supply reaches 1 . 1 V. 
D9 provides the ramp down for the 1 .8V_2 power supply as soon as the 2.5V power supply reaches 1 .IV. 

The 3.3V power supply is connected to the VCC3P3 power plane. 
The 2.5V power supply is connected to the VCC2P5 power plane. 
The L8V_1 power supply is connected to the VCC1P8_1 power plane. 
The 1 .8V_2 power supply is connected to the VCC1P8__2 power plane. 

4.7.20.6 Power Supply Monitoring 

A PLD is used to monitor the voltage status signals from the onboard supplies. It is powered up from +5V and monitors 
+3.3V, +2.5V, 1.8V_1 and +1.8V_2, This circuit monitors the power_good signals from each supply. In the case of a 
power failure in one or more supplies, the PLD will issue a restart to all supplies and a board level reset to the daughter 
card. A latched power status signal will be available from each supply as part of the discrete input word. The latched 
discrete shall indicate any power fault condition since the last off-board reset condition. 
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o 

CO 

w 



y 



ry 



TBD 



5 ELECTRICAL INTERFACE 

5.1,1 Power Consumption 

Table 13, MCW-la CN Power Consumption 



Description 


Oty 


Total Typ. Power 


Total Max. Pwr. 


CE ASIC 


1 


IW 


1.5W 


SDRAM 


5 


3W 


3.5W 


SBSRAM 


2 


1.2W 


2.5W 


G4 


1 


8W 


12W 


Oscillator 


1 


O.IW 


O.IW 


PLD 


1 


0.1 5W 


0.2W 











TBD 
Table 14. 

5.1.2 



MCW-la Power Consumption 



I/O 



5.1.2.1 Over-the-Top RACEway-H- Interlink 

See Appendix A for the over-the-top RACEway-H- interlink connector pinout. 

5.1.2.2 PCI 32-Bit Modem Connector 

See Appendix B for the PCI 32-bit modem connector pinout. 

5.1.2.3 Ethernet 10/lOOBT 

See Appendix C for the Ethemet 10/100 BT connector pinout. 

5.1.2.4 PPC Debugger 

See Appendix D for the PPC Debugger connector pinout. 
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6 MECHANICAL 



6.1.1 Packaging i^jtr^xxr 
The MCW-lis a dual-side PCB assembly. The board is designed to be used in a custom system. The MCW- 

IPCB is TBD thick and TBD layers. 

6.1.2 Physical Constraint 

The PCB board must comply with the Motorola daughter card form fector. 

7 ENVIRONMENTAL 

7.1.1 Temperature & Air Flow 

Operating temperature: TBD 
Storage temperature: TBD 



7.1.2 



I- 

m 
m 
w 

y 

ry 



Humidity 

TBD 



7.1.3 Operating Altitude 
TBD 



7.1.4 Shock & Vibration 

TBD 



7.1.5 Compliance 

TBD 



7.1.6 Reliability 
TBD 



8 SWITCHES & JUMPERS 



8.1 J22 Jumper 

Provisional Hotswap switch interface for the PXBO. 



J22 Ref. Des. 


Jumper Function 


1-2 


PXBO HS HNDL^SWhigh 


2-3 


PXBO HS HNDL SWIow 



8.2 Jll Jumper 

Raceway clock master selection 



J11 Ref. Des. 


Jumper Function 


1-2 (open) 


MCW-1A Master 


1 -2 (shorted) 


MCW-IASIave 
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S3 JIO Jumper 

Fl Raceway XBREQI - 



XBREQO crossover. 



JIO Ref. Des. 


Jumper Function 


3-4,5-6 


Straight througti 


1 - 2, 7 - 8 


Crossover 



8.4 J4 Jumper 

F2 Raceway XBREQI - XBREQO crossover. 



J4 Ref. Des. 


Jumper Function 


3 - 4, 5 - 6 


Straight through 


1-2,7-8 


Crossover 



8.5 J3 Jumper 

F2 Raceway CBL_CLK_0 - CBL_CLK_I crossover. 



o 

m 



8.6 



J3 Ref. Des. 


Jumper Function 


3-4,5-6 


Straight through 


1-2,7-8 


Crossover 



J9 Jumper 

Fl Raceway CBL„CLK_0 - CBL_CLK_I crossover. 



8.7 J18 Jumper 

Miscellaneous control 



J9 Ref. Des. 


Jumper Function 


3-4,5-6 


Straight through 


1-2,7-8 


Crossover 



J18 Ref. Des. 


Jumper Function 


1-2 


WDM fail disable 


3-4 


Serial PROM write enable 


5-6 


FLASH write enable 


7-8 


PXBO use configuration PROM 


9-10 


Unused 



8.8 J21 Jumper 

Master clock source selector 



J21 Ref. Des. 


Jumper Function 


1-2 


Fl cable port master 


3-4 


F2 cable port master 


Both closed 


MCW-1A master 


Both open 


MCW-1 A master 



9 TESTABILITY 
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rii 



9.1 JTAG Test Scan ^ i. 

The MPC7400, MPC8240, PCI-PCI bridge, PCE133 ASIC, PXB++ ASIC, XBAR-h- ASIC, and the Ethernet 
controller provide support for the IEEE Standard 1 149.1 test port (JTAG). Refer to the individual component 
specifications to obtain their JTAG test access port (TAP) descriptions. 

The MCW-la board contains several JTAG scan chains. They provide access to the JTAG test port on the 
MPC7400S, MPC8240, L2 caches, XBAR++, PCE133s, Ethernet, PCI-PCI bridge, and the PXB devices. The 
scan chain is defined as; 
Chain 1 ->MPC7400_1 
Chain 2 -> MPC7400_2 
Chain 3 -> MPC7400_3 
Chain 4 -> MPC7400_3 
Chain 5 -> MPC8240 

Chain 6 -> RESET_PLD, PCEFIX1_PLD, NODE0_HA_PLD, NODEl_HA_PLD, PCEFIX2^PLD, 
NODE2_HA_PLD, NODE3_HA_PLD, 8240_DECODE_PLD, VOTER_SYNC_PLD, 8240_HA_PLD, 
PXB_PROM, L2 Cache_l, PCE133_1, L2 Cache_2, PCE133_2, XBAR, L2_Cache„3, PCE133_3, L2 
Cache_4, PCE133_4, PXB-H-, PCI-PCI Bridge, Ethernet 

The scan path is accessible via connector J 16. The enable for the scan chain buffer is controlled by jumper 
J20. 

The RACEway-H- interlink external connectors will be tested with external loop-back connectors. 

Note: Both the RACEway++ clock (66 MHz) and the PCI clock (33 MHz) must be running to allow the scan path in 
the PXB to function properly. 

10 RACEway+-H Over-the-Top Connector Pinout 
Table 15. RACEway++ Fl Cable Mode Connector Pinout J-1 



Pin 


Signal 


Pin 


Signal 


Al 


GND 


Bl 


CLK_X_JXl_IO 


A2 


GND 


B2 


JXl_CBL_CLKJO 


A3 


GND 


B3 


JX1_XBREQ_1 


A4 


GND 


B4 


JX1_XBREQ_0 


A5 


GND 


B5 


JXl_XBSTROBIO 


A6 


GND 


B6 


JXl_XBRPLYIO 


Al 


GND 


B7 


JXl XBRDCONIO 


A8 


GND 


B8 


JX1_XBIOOO 


A9 


GND 


B9 


JXl_XBIO01 


AlO 


GND 


BIO 


JXl_XBIO02 


All 


GND 


Bll 


JXl_XBIO03 


A12 


GND 


B12 


JXl„XBIO04 


A13 


GND 


B13 


JXl_XBIO05 


A14 


GND 


B14 


JXl_XBIO06 


A15 


GND 


B15 


JXUXBIO07 


A16 


GND 


B16 


JXl_XBIO08 


A17 


GND 


B17 


JXl_XBIO09 


A18 


GND 


B18 


JXl_XBIO10 
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A IQ 


GND II 


B19 


JX1_XBI011 




A70 


GND 


B20 


JX1_XBI012 




A21 


GND i 


B21 


JX1_XBI013 




A22 


GND 


B22 


JX1_XBI014 




A23 


GND 


B23 


JX1_XBI015 






GND 


B24 


JX1_XBI016 




A?S 


GND 


B25 


JX1_XBI017 




A96 


GND 


B26 


JXUXBI018 




A77 


GND 


B27 


JX1_XBI019 




A9R 


GND 


B28 


JXl_XBIO20 




A90 


GND 


B29 


JX1_XBI021 






GND 


B30 


JX1_XBI022 






GND 


B31 


JX1_XB1023 






GND 


B32 


JX1_XBI024 




A33 


GND 


B33 


JX1_XBI025 




A34 


GND 


B34 


JX1_XB1026 


m 


A35 


GND 


B35 


JX1_XBI027 


o 


A36 


GND 


B36 


JX1_XBI028 




A37 


GND 


B37 


JX1_XB1029 




A38 


GND 


B38 


JXl_XBlO30 


c» 


A39 


JX1_XBPAR 


B39 


JX1^XBI031 


m 


A40 


+3.3V 


B40 


R_RST_JX 



m 



Table 16. RACEway++ F2 Cable Mode Connector Pinout J-2 



Pin 


Signal 


Pin 


Signal 


Al 


GND 


Bl 


CLK_X_JX2JO 


A2 


GND 


B2 


JX2_CBL_CLKJO 


A3 


GND 


B3 


JX2_XBREQJ 


A4 


GND 


B4 


JX2_XBREQ_0 


A5 


GND 


B5 


JX2_XBSTROBIO 


A6 


GND 


B6 


JX2_XBRPLYIO 


A7 


GND 


B7 


JX2 XBRDCONIO 


A8 


GND 


B8 


JX2_XBIO00 


A9 


GND 


B9 


JX2_XBIO01 


AlO 


GND 


BIO 


JX2_XBIO02 


All 


GND 


Bll 


JX2_XBIO03 


A12 


GND 


B12 


JX2_XBIO04 


A13 


GND 


B13 


JX2_XBIO05 


A14 


GND 


B14 


JX2„XBIO06 


A15 


GND 


B15 


JX2_XBIO07 


A16 


GND 


B16 


JX2_XBIO08 


A17 


GND 


B17 


JX2_XBIO09 


A18 


GND 


B18 


JX2_XBIO10 


A19 


GND 


B19 


JX2_XBI01 1 
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A20 


GND 


B20 


JX2_XBI012 


A21 


GND 


B21 


JX2_XBI013 


A22 


GND 


B22 


JX2_XBI014 


A23 


GND 


B23 


JX2__XBI015 


A24 


GND 


B24 


JX2_XBIOJ6 


A25 


GND 


B25 


JX2_XBI017 


A26 


GND 


B26 


JX2_XBI018 


All 


GND 


B27 


JX2_XBI019 


A28 


GND 


B28 


JX2_XBIO20 


A29 


GND 


B29 


JX2_XBI021 


A30 


GND 1 B30 


JX2_XBI022 


A31 


GND B31 


JX2_XBI023 


A32 


GND 1 B32 


JX2_XBI024 


A33 


GND 


B33 


JX2_XBI025 


A34 


GND 


B34 


JX2_XBI026 


A35 


GND 


B35 


JX2_XBI027 


A36 


GND 


B36 


JX2_XBI028 


A37 


GND 


B37 


JX2_XBI029 


A38 


GND 


B38 


JX2_XBIO30 


A39 


JX2_XBPAR 


B39 


JX2_XBI031 


A40 


+3.3V 


B40 


R_RST_JX 



y 

o 
w 

ft 
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1 1 Modem Board Connector Pinout 

Table 17. Modem Board Connector Pin Assignments 



a 



llJ 

fU 



J29 


Pin 


Signal 


Signal 


Pin 

r 111 


1 


5V 


PMC_ADO 


2 


<i 
\? 


5V 


PMG^ADI 


4 


v> 


V/ V 


PMC_AD2 


6 


# 


O V 


PMC_AD3 


8 


Q 


PPI R*=5T 0 


PMC_AD4 


10 


i A 

1 1 




PMC AD5 


12 


lo 




PMC AD6 


14 


To 


nlvlO_tL./OCL._ 1 


PMC AD7 


16 


17 


OV 


PMC ADS 


18 


iy 


OV 


PMC AD9 


20 


OH 

21 


D^/l^' TDnv n 

KlvlO_ I KLrY_U 


PMC AD10 


22 


23 




PMC AD11 


24 


25 


C3NL/ 




26 


27 


r'ML/_o 1 <Jr'_U 


PMP ADI*^ 


28 


29 


OV 


PMP Ani4 


30 


31 


OV 


PMP ADI'S 


32 


33 


KMU_r tKK_U 


PMC AD1fi 


34 


35 




PMP AD17 


36 


37 


C:>NU 


PMP AniR 


38 


39 


rlViU_otr\K_U 


PMP AD1Q 


40 


41 


5V 


PMC_AD20 


42 


43 


5V 


PMC_AD21 


44 


45 


CLK_PMC 


PMC_AD22 


46 


47 


GND 


PMC_AD23 


48 


49 


GND 


PMC_AD24 


50 


51 


PMC_C„BEO 


PMC_AD25 


52 


53 


PMC_C3E1 


PMC_AD26 


54 


55 


5V 


PMC_AD27 


56 


57 


5V 


PMC_AD28 


58 


59 


PMC_C_BE2 


PMC_AD29 


60 


61 


PMC_C_BE3 


PMC_AD30 


62 


63 


GND 


PMC_AD31 


64 


65 


GND 


5V 


66 
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67 


GND 


PMC^FRAME^O 


63 


69 


PMC_INTA^O 


GND 


70 


71 


GND 


PMC_lRDY_0 


72 


73 


GND 


5V 


74 


75 


PMC^GNT^O 


PMG_DEVSEL_0 


76 


77 


5V 


PMC„LOCK_0 


78 


79 


PMC_REQ_0 


PMC_PAR 


80 



^0 



4 

13 
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12 Processor JTAG Connector Pinout 

The JTAG connectors are unique to each processor. Table 1 8 shows the generic signal names on each connector pin, the 
actual names will have each processor's extension appended to the generic signal name. 
Table 18. JTAG Jx Connectors Pin Assignments 



Jx- 


SIGNAL 


Jx- 


SIGNAL 


1 


TDO 


2 


QACKN 


3 


TDI 


4 


TRSTN 


5 


HALTEDN 


6 


3.3V 


7 


TCK 


8 


CKSTOPJNN 


9 


TMS 


10 


N.C. 


11 


SRESETN 


12 


RC. 


13 


HRESETN 


14 


«key» 


15 


CKSTOP_OUTN 


16 


GND 



m 

y 



4< 

5 5.5 
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13 Non-Processor JTAG Connector Pinout 

The non-processor JTAG connector ties together all the remaining JTAG capable devices together. Table 18 shows the 
signal names on each connector pin. The connector is designed to only include the programmable PLDs and PROM 
when the program cable is installed, or the entire chain when the Boundary scan test connector is installed. 
Table 19. JTAG Jl 6 Connectors Pin Assignments 



J16- 


Signal 


Description 


1 


TMS_JTAG 


JTAG Test Mode Select 


2 


TDIJTAG 


JTAG Test Data In 


3 


TDOJTAG 


Boundary Scan Test Data Out 


4 


TESTN 


Driven low when connector inserted 


5 


TCK_JTAG 


JTAG Test Clock 


6 


GND 


Ground on module 


7 


PXB^CNF_TDO 


TDO from end of PLD chain 


8 


TDI NDO 


TDI into non-PLD Chain 


9 


+5V 


+5V Power on Module 


10 


TEST 


Driven high when connector inserted 



o 
m 



PLD Program Configuration 



TMS 
TDI 
TCK 
TDO 
Power 



J16-1 TMS^JTAG 
-J16-2TDLJTAG 
J16-5TCK_JTAG 
J 16-7 PXB_GNF_TDO 
J 16-9 Power 
TESTN 
GND 



PROM 



TMS - 
TDI 
TCK 



Boundary Scan Test Configuration 

LD — > j PLD | — PROM 



TDO 

Power 



J16-1 TMS_JTAG 
Jl 6-2 TDI_ JTAG — 

J 16-5 TGK_JTAG 
• J16-7PXB_CNF_TDO 

J16-8TDLND0 — 

J16-3TDO_JTAG ^ 

J 16-9 Power 
■ TESTN 

GND 



-CZNI3-)[ 



Figure 8. JTAG CONNECTOR CONFIGURATION OPTIONS 
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14 Design Notes 

14.1 MPC7400 and Nitre Bus Signaling Voltage Support 



£» 

m 
w 

y 





1.8V 


2.5V 


3.3V 


MPC7400 V 60x 


Yes 


Yes 


Yes 


MPC7400 V L2 


Yes 


Yes 


Yes 


NItro V 60x 


Yes 


Yes 


No 


NitroVL2 


Yes 


Yes 


No 


PCE133V60X 


No 


Yes 


No 


SBSRAM Vi/o 


No 


Yes 


Yes 



14.2 Bypass Capacitors Selection 
(Based on App, Note from Micron TN-00-06) 

Vcore = 3.3V +/- 0,165V, which is 5% 
Vi/o == 2.5V +/- 0.125V, which is 5% 

When the SBSRAMs are driying 21pf load from OV to 2.5V with Ins edges, the transient current is: 
I = (C * dV)/dt = (30pf*2.5V)/lns = 75ma per one I/O pin. 
For 36 I/O, 36*75ma = 2.7A in Ins time interyal. 

The SyncBurst SRAM has a VDD tolerance of 3.3V +/-0, 1 65V. Considering some droop from the power bus and a switching 
time of 1 ns, and allowing a maximum voltage dip (DV) on the SRAM of -0,05 V, the choice of bypass capacitor becomes: 

C = ( I * dt)/dV = (2.7A * 1)/0.05 = 54nF per one SBSRAM. 

Choosing 6 x lOnf allows some margin. 

It is better to use reyerse ratio capacitors 0508, 0406, or 0204, 

The low ESR is also yery important. 

Temperature stable dielectric as X7R, 

From Vishay VJ0402 style X7R. 

14.3 Tantalum Capacitors Selection 

Ultra-low ESR tantalum capacitors T510 are used in the switching power supply, besides several bulk storage capacitors 
distributed around the PCB that feed Vcore and Vi/o plains, to enable quick recharging of the bypass chip capacitors. 
The number of the bulk-storage tantalum capacitors depends on the power supply response time characteristic. 



The MPC7400 can go from nap mode to full-on mode power within two cycles. 

1 core = (low - 2W) /1.8V = 4.5A 
dt= lOiis 

C = (I * dt)/dV - (4.5 A * lOps) / 0.05V = 900pF 
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MEMO AF-4 
Prototype Framework VO.l 



c 
w 



o 
ry 



a 

1. Introduction ^ 

LI. Transform Object ^ 

1.2. Red-Box ^ 

2. Transform Object Sample ^ 

2.1. Include the following files to define the interface, and variables required 5 

The contents of dxjdma„var.h: ^ 

2.2. Initialize the interface * ^ 

2. 3 . Receive input . ^ 



2.3.1. An Example of the receiving of data fi-om input pin 0 6 

2.4. Send Output ^ 

2.4.1. An Example of the sending data on output pin 0 7 

3. Transforms for WCDM Simulation: 8 

i^^ 3.1. handset (one of n): | 

3.1.1. input pins: ^ 

3.1.2. Output pins: ^ 

3.2. Chan (set of one to m objects): 8 

3.2.1. Input pins: ^ 

fg 3.2.2. Output pins: ^ 

3.3. broadcast (set of one to k objects): 9 

3.3.1. Input pins: ^ 

3.3.2. Output pins: ^ 

3.4. Rake (one of n): ^ 

3.4.1. Input pins: 1^ 

3.4.2. Output pins: 

3.5. MUX (set of one to L objects): 10 

3.5.1. Input pins: • 1^ 

3.5.2. Output pins: 1^ 

3.6. MUD (one object for now): 10 

3.6.1. Input pins: H 

3.6.2. Output pins: H 

3.7. BER (set of one to m objects): 1 1 

3.7.1. Input pins: H 

3.7.2. Output pins: H 
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1 . Introduction 

This is a very brief description of the prototype framework and how to use it. The purpose of this 

memo is to describe the software interfaces from within a transform object. 



-^0 



w 




.p. 



The above figure depicts the software architecture, and the transform object is a part of the 
Application that is managed by the Application framework. 
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r 1.1- Transform Object 

The transform object is the basic buUding block and can be like a Turbo -coder, QAM modulator 
W ®tc. 

^p; 1.2. Red-Box 

£Jd The red-box collects transform objects into a logical grouping that describes all of the processing 

I U that will be carried out on a single CPU.. (Note for reasons of non real-time operation eg 

simulation collections of red-boxes can be on a single CPU). 
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2. Transform Object Sample 

2.1. Include the following files to define the interface, and variables 
required.. 



#include "mc_error.h" 
#include "mcwLh" 
#include "dx^dma.h" 
#include "dx_dma_var.h" 

2.1-1. The contents of dx„dma_var.h: 

int Tny_logical_ce ; 
CONFIG_data *ptr_conf ig_base ; 
CONFIG_data *ptr_cur__conf ig; 
CONFIG_data *ptr_trnp_conf ig; 

int active_in_ce [ (MAX_CE+1) * MAX_CHAN] ; 
ift? int active_in_ch[ (MAX_CE+1) * MAX_CHAN] ; 

int active_in_buf_size [ (MAX_CE+1) * MAX_CHAN] ; 
^ char *active_in_buf [ (MAX_CE+1) * MAX_CHAN] ; 

^ int active_in_index; 

to int active_out_ce [ (MAX_CE+1) * MAX_CHAN] ; 

ilj int active_out_ch[ (MAX_CE+1) * MAX_CHAN] ; 

^" int active_out_buf_size [(MAX_CE+1) * MAX__CHAN] ; 

char *active_out_buf [ (MAX_CE+1) * MAX_CHAN] ; 
J**!; int active_out_index; 

#define dma_send_j>in (pin) \ 

dma_send ( \ 
J£ my_logical_ce, \ 

W active_out_ce [pin] , \ 

Its active_out_ch [pin] , \ 

(char **) &active_out_buf [pin] \ 
) 

#define dma_rec_pin (pin) \ 

dtna_rec ( \ 
active_in_ce [pin] , \ 
my_logical_ce, \ 
active_in_ch [pin] , \ 
(char **) &active_in_buf [pin] \ 
) 

2.2. Initialize the interface 

// get config SMB 

dma_aU_ijQit( 
my_logicaI_ce, 
active_in_ce, 
active_in_ch, 
active_in_buf_size. 
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active_in_buf, 

(int *)&active_in_index, 

active_out_ce, 

active_out_ch, 

active_out_buf_size, 

active_out_buf, 

(int *)&active_out_index, 

(CONFIG Jata **)&ptr_config_base 

); 

ptr__cur_config = &ptr_config_base[myjogical_ce]; 
#ifdef debug_print 

printf("Vir CE %i, module name is %s\n", 
iny_logical_ce,ptr_cur_config->module_name) ; 
#endif 

ptr_cur_config->state = STATE_RDY; /* all init done now ready */ 
//wait for rx to be ready 

ptr_tmp_config = &ptr_config_base[active_out_ce[0]]; 
while (ptr_tmp_config->state != STATE_RDY) //need reciver to be ready 

sched_yield(); 

//wait for tx to be ready 

ptr_tmp_config = «feptr_config_base[active_in_ce[0]]; 
while (ptr_tmp_config->state != STATE_RDY) //need reciver to be ready 

sched_yield(); 

#ifdef debug_print 

printfCUCE %i, Virtual CE %i, Starting\n",(int)ce_getid(),my_logical„ce); 
#endif 

2.3. Receive input 

Receive input data if required, input pins can be left unused. 

2.3.1. An Example of the receiving of data from input pin 0 

/* get data from other CE */ 
rc = dma_recjpin(0); 
ERROR_MCWl(rc); 

OR 

rc = dma_rec( 
active_in_ce[0], 
my_logical_ce, 
active_in_ch[0], 
(char **)&active_in_buf[0] 

); 

ERROR_MCWl(rc); 
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The data is available in the activejn_buf pointer,, note this always points to the next available 
input buffer in the case of multi -buffering,, at a later date the size of input chunk and offset will be 
provided so that a FIFO like structure can be used. 

2.4. Send Output 

Send output data if required, output pins can be left unused. 

2.4.1. An Example of the sending data on output pin 0 

/* send data to other CE */ 
rc = dma_send_pin(0); 
ERROR_MCWl(rc); 

OR 

1^ rc - (long)dma_send( 

|3 myJogical__ce, 

t3 active_out_ce[0], 

^0 act ive_out_ch[0] , 

^fl^ (char **)&active_outj>uf[0] 

m ); 

W ERROR_MCWl(rc); 

\a 

* The data in the active_out_buf pointer will be sent, on return this always points to the next 

O available output buffer in the case of multi -buffering. At a later date the size of output chunk and 

offset will be provided so that a FIFO like structure can be used. 
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3. Transforms for WCDM Simulation: 
3.1 . handset (one of n): 

This object has two input pins and one output pin. It performs the: 

1 . Generate transport channel 

2. MUX and channel coding 

3. Generate TX waveform 

4. Simulate RX system for Power control etc. 

5. Outputs to the chan model 

3.1.1. input pins: 

3.1.1.1. power_control pin 0 : 

Input to this pin is from output pin 0 of the rake block and is the slot power control. 

f 3. 1 . 1 .2. next_chuiik pin 1 : 

Input to this pin is from output pin 1 of the BER block and is the send next n symbols for 
processing e.g. 2 symbols, or a slot etc. 



m 
w 



I: 



3.1.13, next_chunk pin 1: 

Optional input pin, used to provide external ie outside of the Generate traffic channel bits, access 
to the raw data input ie if we did a codec the output of the codec would go into this block. 

3.1.2. Output pins: 



, ^ 3.1.2.1. signal_out pin 0 : 

1^" This pin goes to one input pin of the chan object group. 



*P 3.1.2.2. raw_bits pin 1: 

C3 This pin has the raw data bits as encoded into the Data channel so that the BER, BLER 

ft= calculations can be done. 

3.2. Clian (set of one to m objects): 

In this group of objects, each has; two to n input pins; and one output pin each. They collectively 
perform the: 

1 . Channel model for each of the inputs except the carry pin 

2. Sums the local signals, and adds the carry input pin 

3. Outputs to the front_end object to send same data to all rake inputs 

3.2.1. input pins: 

3.2.1.1. sumjn pin 0 : 

Input to this pin is from output pin 0 of other channel object, currently a dummy input is required 
on this pin for the process to fire (needs more thought ie a special first chan??). 

3.2.1.2. signal.in pin 1 to n: 

Input to this pin is from output pin 0 of the handset block. 
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3.2.2. Output pins: 
3.2.2.1. signal_out pin 0 : 

This pin goes to input pin 0 of the broadcast object. 
3.3. front__end (one object): 

In this object, each has; one input pin; and one output pin. It performs the: 

1 . A dd.s the-fBH^tiple antejftmh-^ i d ot her Rccei\cr distortions and noise 

2. Simulate RX system (AGC^A/D, multiple antennas) etc. 

3. Outputs to the broadcast object to send same data to all rake inputs 

Multiple antennas should be treated as separate data streams. The rake receiver will process them 
independently, until the MRC stage. 

3.3.1. Input pins: 

3.3.1.1. signal.in pin 0 : 

^2' Input to this pin is from output pin 0 of the last channel object. 

jO 3,3-2- Output pins: 

£fj 3.3.2.1. signal_out pin 0 to n: 

This pin goes to input pin 0 of the broadcast objects. 

f% 3.4. broadcast (set of one to k objects): 

I J This object is required to simulate broadcast, until the simple framework supports this feature, we 

need this object. 

Each object in the group has one input pin and one to n output pins. They collectively perform 
O the: 

fIJ 

1. Takes one input and copies it to all of the output pins tm-modified 

2. Outputs same data to all rake input 0 pins. 

3.4.1. input pins: 
3.4.1.1. signal_in pin 0 : 

Input to this pin is from output pin 0 of the front_end object. 

3.4.2. Output pins: 

3.4.2.1. signal.out pin 0 ton: 

This pin goes to input pin 0 of the rake objects. 

3.5. Rake (one of n): 

This object has one input pin and two output pins. It performs the: 

L A^AFC 

2. Initial signal acquisition and sS earcher receiverR X 

3 . Multiple finger receixersfe ^ 
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4. Channel estimation. M RC etc. 

5. Final data channel despreading. 
§t6. ^Outputs to: 

• MUD group of proccsvscs 

* Soft-decision symbol processing (FEC decoding and demultipkxing (25 .21 2) 



3.5.1. Input pins: 
3.5.1.1. signal_in pin 0 : 

This is the data from the broadcast set, and carries the signals of all the handsets, and noise etc. 



3.5.2. Output pins: 
1*5: 3.5*2.1. power_control pin 0 : 

p This is the slot power control to be sent back to the handset. 



3.5.2.2. signal_out pin 0 : 

This pin goes to one input pin of the MUX object group. 



3-6- M UX (set of one to L objects) : 

This object is required to gather and package information from the 1 to n rake objects. The inputs 
r are placed into packets(???) or into arrays (???) To Be Determined (TBD). This object should be 

13 morphed into the best approximation of the packaging to be provided by a targeted modem. 

Each object in the group has one to n input pins and one output pin. They collectively perform 
the: 

L Package rake information into simulated modem sourced data. 
|f j 2. Outputs to MUD input 0 pin (for now until MUD integration there will be a dummy 

placeholder block) . 

3.6.1. Input pins: 

3.6.1.1. signaljn pin 0 to L: 

Input to this pin is from output pin 1 of the a rake object, or another MUX objects output pin 0 . 

3.6.2. Output pins: 
3.6.2.1. signal_out pin 0 : 

This pin goes to input pin 0 of the rake objects. 
3.7. MUD (one object for now): 

This object is required to place hold until a real mud is implemented. 
MUD has one input pin and one output pin. 

1 . Passes through data and formats it for the BER block 

2. Outputs to BER input 0 pin. 
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3.7.1. Input pins: 
3»7.1.1. signal_in pin 0 : 

Input to this pin is from output pin 0 of the MUX object. 

3.7.2. Output pins: 

3.7.2.1. signal.out pin 0 : 

This pin goes to input pin 0 of the BER object. 

3.8. BER (set of one torn objects): 

This object is required to gather and package information from the 1 to n handset objects and the 
MUD. The inputs are placed into packets(???) or into arrays (???) To Be Determined (TBD). 
This object should be morphed into the best approximation of the packaging to be required by a 
targeted modem. It also compares the raw input data and raw received data. It also does the EEC 
!^ detection and correction and Block error rate. 

13 Each object in the group has one to n input pins and one to n+1 output pins. They collectively 

%Q perform the: 

^0 1 . Package rake/MUD information into simulated modem destination data. 

2. Perform all of the bit level processing, interleaving, FEC, - This should be in a separate block. 

3. BER. BLBR etc. BLER should be done via the CRC check, after aJl symbol decoding is 
perlbrmed. 

3t4. Outputs to GUI mput 0 pin to display the stats. 
4t5. O utputs the generate the next slot command to the one to n handsets. 



3.8.1. Input pins: 
3.8.1.1. signal.in pin 0 torn: 

Input to this pin is from output pin 0 {for now until MUD integrated} of the MUD object, or 



il^ another output pin 0 of a BER object. 

3.8.2. Output pins: 

3.8.2.1. stats_out pin 0 : 

This pin goes to input pin 0 of the host object for display of data on the GUI. 

3.8.2.2. next_slot pin 1 (one of n); 

This pin goes to input pin 1 of the handset object to indicate the system is ready for the next slot 
of data. 
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From: Jon Greene <greene@mc.com> 

To: "Lauginiger, Frank" <fpl@mc.com>, <joates@mc.com>, <afuchs@mc.com>, 

<nnvinskus@ mc.com> 
Date: 6/23/00 3:05PM 

Subject: Some MUD analysis 

All: 

Obviously, IVe been thinking about MUD a lot. Below is some analysis. 

First, some news. We apparently have 400 Mhz, 2 meg / 266 Mhz L2 Nitros in 
house (samples). Vitaiy is presently working to bring them up. This is 
excellent news. Besides the above speed/size properties, Nitros use 
significantly lower power than Max's and allow for varying L2 configuration 
options. Nitro L2's can be configured the normal way (as a cache) or all or 
half (1 meg) as SRAM memory and can be addressed as such directly. For 
example, one can write a buffer into this memory with vmov or, better yet, 
as the output of some computation. I'm not sure if it could be the source or 
target of a RACEway xfer but we should try to find this out. Even if 
configured as a coherent cache, it can be easily locked and unlocked in user 
mode. I think configuring as 2 meg of SRAM may work the best for MUD but we 
O should determine this empirically. 

i:p Now, a critical analysis of ops, buffer sizes, bandwidth, access patterns, 

IQ algorithm structure and phases of the moon, are all essential to arriving at 

f% a strategy that stands a chance of working. This of course is not easy 

because various techniques impact ail of the above in unequal ways. Let's 
just consider the R1/R1m R-matrix processing on the above Nitro with a 
^ maximum of 100 users. *Without* taking advantage of the diagonal symmetry in 

l3 the Corr matrix, which I now believe will be very difficult to do in the 

ly R-matrix ucoded processing loop(s) (we should discuss this), but still 

iX assuming Corr *can* effectively exist as a byte matrix without degrading 

* ^ accuracy beyond acceptability, a single plane (i.e.. a processor's worth) of 

the Corr matrix requires 200 * 200 * 32 = 1 ,280,000 bytes which fits, albeit 
O uncomfortably, into the L2. At 2 gigabyes/sec (~ 266 * 8), this matrix (If 

fy L2 resident) can theoretically be consumed in 0.64 ms (remember, 1 .33 ms. is 

our budget). Now, *if * we go with a completely separate X matrix calculation 
without stripmining *and* we also store it as byte values, it would require 
at most 100 * 100 * 32 = 320,000 bytes. This must be entirely produced and 
consumed in the 1.33 ms. time slice. In *theory*, this can be done in 0.32 
ms. Finally, the R1_temp output is of size 200 * 200 = 40,000 bytes and can 
be produced in .02 ms. So, with the fully separate X matrix approach and no 
symmetry in the Corr, we theoretically require -1,750,000 bytes of buffer 
size (I added a little more for stray stuff such as the C vectors and the 
phys <=> virt Luts, etc.) and ~1 .0 ms. to produce and consume these buffers. 
If we stripmined X, which seems a better way to go, we could hopefully keep 
it resident in L1 , thereby reducing L2 buffers to -1,350,000 bytes and 0.7 
ms of L2 I/O. The stripmining also allows us the option of keeping the X 
strip as shorts rather than bytes. 

Now lets consider the ops count. For the R1/R1m processing (including the 
generation of the X matrix and 2 antennas), I come up with (2 * 6 * 100 * 
100 * 16 + 4 * 200 * 200 * 16) * 750 = (1920,000 + 2,560,000) * 750 = 3.36 
GOPS. (BTW, if you were wondering, 750 = 1000/1.33.) The RO processing has 
less GOPS due to the symmetry. I get (1920,000 + 2,560,000/2) * 750 = 2.40 
GOPS. Since the RO and RI/RIm processing use the same X matrix, we may be 
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tempted to consider having only the RO processor compute the X matrix and 
ship it to the R1/R1m processor. This looks nice from a GOPS perspective (RO 
= 2.40, R1/R1 m = 1 .92) but I'm not sure it will worl< very well given the 
Iocl<step nature of the processing pipe. For example, will the R1/R1m 
processor simply be idle waiting for the X matrix or will it be completing 
the *prior* R1_temp processing while the RO processor is computing the 
current X? 

But the real killer about having RO ship X to R1/R1 m is that the X matrix 
(320,000 bytes) will take at least 1.23 ms. over RACE++ 
(320,000/260,000,000). And let's not forget the 40,000 byte R_temp output 
matrix that has to also be shipped out in the same time frame. So I don't 
think this OPs balancing approach will work. 

We therefore appear to require 3.36 GOPS out of R1/R1 m and we might just not 
even bother with the RO symmetry since it doesn't buy you very much given 
that mpic needs both RO and R1/R1m as inputs. In other words, have both 
R-matrix processors mn essentially the same code. (Will this work?) 

k^'- Now 3.36 GOPS out of one processor is a tall order. We may have to resort to 

f 3 a more asymmetric division of labor (The RO processor takes advantage of the 

Q RO symmetry and also does a portion of R1/R1 m). But, I'd like to pursue the 

more balanced division until we are absolutely sure It won't work. 

J2 It this approach, both the RO and R1/R1 m processors independently produce 

|S and consume X in strips. A variant could instead produce and consume a 

W single "value" (actually 32 shorts) of X in a single ucode primitive that 

I J does both the complex multiplies and the dot products (the MUDder of all 

5 " primitives). The former is certainly the easier approach and might get us 

f ftjs all the way there but the latter, if it can be cleverly coded, may perform 

z% better. In all cases, the ops don't change but at least the L2 gets some 

r^" breathing room. 

*P In any event, the so -called dot-product loop, whether it's separate or 

JJi includes the complex multiply, still remains a difficult piece of code to 

itz fully optimize if we allow the number of virtual to physical users to vary 

as MUD (and Dr. Oates) demands. Using a LUT to acquire the index list and 
count of virtual users for a given physical user will tend to throttle the 
dot product code due to short vector lengths, funny address calculations, 
and "random" load and store patterns. The load isn't so bad since it's two 
cache lines no matter where it comes from. We may want to reorder Corr 
anyway just to ease the address arithmetic and DST logic. We could also 
simply store in the order we produce and leave it to the mpic processor to 
reorder (poor guy). As for the short vector count, I think this can be 
overcome with a clever primitive that "pauses" as little as possible between 
index lists but this will take some careful design. 

I think we should try for the "balanced" stripmlne approach with essentially 
the same two primitives running in each processor. In the absence of 
dissenting views, I will continue modifying the C code to realize this 
structure. I'm still not sure where the Amp/fac_xx multiply(s)/shift(s) 
belong but for now I'll rid them entirely from the R-matrix functions that 
I'm preparing for ucoding. 

-Jon 
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CC: 



"Kenny , Jamie" <jfk@mc.com> 
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^^^1 J. Computer Systems, Ina 

^ Mercury 

^^^H 199 Riverneck Road 
^^^1 Chelmsford, MA 01824-2820 
^^^1 (978) 256-1300 • Fax (978) 256-3599 
^^^H http://www.inc.com 

Begon 

To: Wireless Communications Group 
From: J. H. Oates 

Subject: Channel Estimation Date: October 20, 2000 

b 1. Introduction 

^'9 In the conventional RAKE receiver, channel amplitude^ estimation is required for maximal 

J2 ratio combining the RAKE fingers. The BER performance is not strongly dependent on the 

accuracy of the channel amplitude estimates. For Multi-User Detection (MUD) the channel 
i^J amplitude estimates are used for signal subtraction, and accuracy of the channel 

^ amplitude estimates is more critical. In addition, the channel estimation error is larger 

when MUD is used since channel estimation is performed in a higher interference 
environment. This report investigates the accuracy of the conventional channel amplitude 
estimation techniques under elevated multiple access interference. The effect of channel 
r: amplitude estimation error on MUD efficiency is then assessed. The analysis presented 

fC here is intended to be a first-look. There are a number of ways to increase the channel 

I y amplitude estimation accuracy. A few of these are discussed below. 

Section 2 presents a model for the received signal and match-filter outputs. The effect of 
channel estimation error on MUD efficiency is addressed in section 3. In section 4 the 
accuracy of the conventional channel amplitude estimates is assessed. In section 5 
improved single-user methods are presented for channel amplitude estimation. Section 6 
presents a multi-user channel amplitude estimation method. Section 7 addresses the 
effect of uncancelled multipath on the MUD efficiency, which is used in section 8 to 
assess the effect of dropping small amplitudes. It is shown that the overall MUD efficiency 
is improved by dropping small amplitudes. Conclusions are drawn in section 9. 

2. Signal Model and Matched-Fiiter Outputs 

The baseband received signal can be written 



^ Amplitudes are complex and hence Include magnitude and phase. 
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r[^] = £ S ^* - mT]b, [m] + m 



(1) 



Jfe=l m 



where t is the integer time sample index, T = NNc is the data bit duration, N = 256 is the 
short-code length, Nc is the number of samples per chip, w[t] is receiver noise, and where 
s^[t] is the channel-corrupted signature waveform for virtual user k. For L multipath 
components the channel-corrupted signature waveform for virtual user Zeis modeled as 



m 



(2) 



where a/^p are the complex multipath amplitudes. Notice that a^p = aip if k and / are two 
virtual users corresponding to the same physical user. This is due to the fact that the 
signal waveforms of all virtual users corresponding to the same physical user pass 
through the same channel. For multiple antennas Skp is a vector. For dual antennas, for 
example, primary and diversity, 



a 



d,kq 



(3) 



The waveform Sk[t] is referred to as the signature waveform for the kth virtual user. This 
waveform is generated by passing the spreading code sequence c^[n] through a pulse- 
shaping filter g[t] 



y 



(4) 



r=0 



where N = 256 and g[t} is the raised-cosine pulse shape. Since g[t] is a raised-cosine 
pulse as opposed to a root-raised-cosine pulse, the received signal r[t] represents the 
baseband signal after filtering by the matched chip filter. Note that for spreading factors 
less than 256 some of the chips Ck[r] are zero. 



Combining Equations (1) through (4) gives 

rlt] = XXi^.,^ -mJ -r,,]K[m + w[t] (5) 

Jfc=l m p=l 

The output of the despreading operation for a single multipath component is the complex 
statistic 
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m 
y 



Z/V y „ 

i^lnt] ^^J^ninN, +t\ + mT] ■ c]{n\ 



2Nj „ 



where f,, is the estimate of t,, , and A/, is the (non-zero) length of code cln]. The values 
yiq[m] are complex and are referred to as the pre-MRC matched-filter outputs. For multiple 
antennas, r[t}, wp], yiq[m] and Wiq[m] are column vectors. 



O The matched-filter output is then 



w 



it=l m' 



L?=i p=i J 



w,[m] = Rej 

where is the estimate of a^^ and W/M is the match-filtered receiver noise. The terms 
for mV 0 result from asynchronous users. 



Page No. 74 



EV 093 931 797 US 
Page No. 101 

3. Effect of Amplitude Estimation Error on MUD Efficiency 

MUD efficiency is defined in terms of tlie ratio of the intra-cell interference with MUD (Imud) 
to the intra-cell interference with the Matched Filter (MF), that is, the intra-cell interference 
without MUD (Imf): 

Q ^ 1 _ ^MUD (8) 
^ MF 

The total interference without MUD is Imf + J, where J is the inter-cell interference. 
Similarly, the total interference with MUD is Imud + J- The ratio of inter-cell interference to 
intra-cell interference without MUD is denoted f = J/lMF' The increase in system capacity is 
equal to the ratio of the total interference without MUD to the total interference with MUD, 
which is (Imf+J)/(Imud+J) = Omf-^ flMFVOMuo flMF) = + 0/(1 - I^MUD^ fy For 0.3 and 
pMUD = 0.7, MUD increases the system capacity by a factor of 1.3/(1 - 0.7 +0.3) = 2.2. 
Hence, if our goal is to double system capacity the MUD efficiency must be approximately 
70% or greater. 

^3 In the following we estimate the loss in MUD efficiency, 1 - Pmud. due to imperfect channel 

^2 estimation. For simplicity of presentation we consider approximately synchronous users. 

Recall that in a synchronous system the matched -filter outputs can be expressed as 

Ill 

I* and that the intra-cell interference is then 



rij 



The effect of channel amplitude errors is that the estimates of the R-matrix elements {ni) 
are imperfect, which reduces the interference that is cancelled. When MUD is employed 
with imperfect R-matrix estimates the detection statistic is 



3^/ - Y^^iA = 

K K 



where for the present case we have assumed that the bit estimates are perfect. With 
MUD the intra-cell interference is 
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Now from Equation (7), specialized for synciironous users 



2 V=] p=l 

B = T X X l^/^ ^^P ■ ^Ikqp + * ^ikqp J 

^kp = ^Jt/? ~ ^/fcp 

m 

Hence the second-order statistics are 
f| =-A i^^^^C^,, •£«a,,.C,;.,. +£^a^C,;^ Kek^C,^-,] 

IIP ^9.p=I^\p'=l 



=7^ i^fe^^ ■<% 

•[2 + 2Re(p„p;)] 



=-l-A^E^2•[l+|pr] 



27V 
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where we have assumed that the amplitude error is independent of the amplitude and we 
have used 



m 
a 

m 
y 

W 

"ftSfr- 

m 



(15) 



The second expression is discussed below. We refer to Ek as the error amplitude for the 
tah virtual user. The residual interference after MUD IC is 



l2 



= ^liK-V)aEl+KEl\2i+\p\'] 



(16) 



2N, 



tz + ^/fe-2.t+|p|^] 



where all data channels have amplitude A, The error amplitude for the control channels is 
denoted Ec and the error amplitude for the data channels is denoted E^. All data channel 
amplitudes are determined by scaling the corresponding control channel amplitudes by 
l/pc- Hence Ed= E^/pc- 



Similarly we can show that 

so that the matched-filter Interference is 

■I 

- iK -l)ccA^ + KfilA" ]• 2 • [l+ 1 p p ] 



(17) 



A" 



2N, 
2N, 



(18) 



Finally, the MUD efficiency is 



R — 1 ^MUD _ 1 I 

Hmud - ^ J ~ I A 

I MP V 



(19) 
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4. Conventional Channel Estimation 



The conventional channel amplitude estimate is given by 



1^ 



m 



J M 



L -I M 1 A/ 

= X£«*. •ZC,,^N']iX^'/[m] ^^Jm-ml +— XH'J'n] f',[m] (20) 

Jt=l p=l m' ^ m=l ^ m=l 

L 

4=1 p=l 



where 



m' 

f'^'l ^ ^X^/ ['^l 'h\m-m'] (21 ) 



2 M 



In the above b{m] represent the known pilot bits. (The /th virtual user is implicitly a control 
channel.) The number M represents the number of pilot bits used to derive the channel 
amplitude estimates. The channel amplitude estimate can be rewritten 



k=\ p=i 



L L 
p^q k*l 



It Is shown in the appendix that 

E\ti.., -H-A^^ =5„.5^.5^.5^.e{h^^ (23) 



Uqkp rq'k'p' Iig^i^ If kk' qq' pp' I Iqkp I Jjq^jcp 

Hence the variance of the estimate is 
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^k/, • }= i ^{ I " }• ^k/. • « ;./p } 



%0 



+ 



K L 



K, L 



L K L 



p=l k==l p=l 

p*q k^l 



The factor pe simply reflects the fact that the off-diagonal elements are smaller than the 
O diagonal elements due to partial correlations pkp between the antenna elements. In the 

0 Appendix it is also shown that 



3 4"*rL=^ 



(25) 



W Now combining Equations (24) and (25) gives for the variance of the channel amplitude 

^^ estimate 



K, L 



p^q k^l 

1 ^ 1 i 



p^q k^l 



1 L-1 1 



N, L • MN,t:{ 



where we have used >4/ = Af/L The first term represents the variance due to a user's 
own multipath interference. This term is small compared to the variance arising from the 
total multiple-access interference. For simplicity we incorporate part of this term into the 
second term and drop the remainder. The final term represents thermal noise and other- 
cell interference. For now we assume that thermal noise in small. The interference arising 
from other cells is assumed to be proportional to the same-cell interference, with a 
constant of proportionality f- 0.35. With these assumptions we have 
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m 
y 

y 
I- 



(27) 



Notice that the magnitude of the error E/ is approximately the same for all users. Also, the 
/th users is implicitly a control channel, and hence N,= PG = 256. If the virtual users 
are all at the highest spreading factor, then in terms of the K = /Cy2 physical users we 
have 



E,'=(l + /) 



M PG 



(28) 



where Ec is the magnitude of the channel amplitude error for a control channel, is the 
relative control channel amplitude, A is the amplitude for the data channels, and where a 
is the activity factor for the data channels. Since the channel amplitudes for the data 
channels are determined by scaling the arr^tlitude of the corresponding control channel it 
is evident that Ed = Et/A;- Hence, 




= (1 + /) 



KL 



MPG 



1 + 



a 



(29) 



Given the parameters 



f 

K 
L 
M 
PG 

a 

Pc 



= 0.35 
= 128 
= 4 
= 18 
= 256 
= 0.4 
= 0.7333 



we get 



E 
A 



-r = J(}+f) 



KL 



M PG 



1 + 



(l4-0.35)il^[l + ^:^ 



(18)(256)[ (0.7333)' _ 



(30) 



= 0.51 

The number of pilot bits, M, is taken to be 18, which represents 6 bits per slot, the 
amplitudes averaged over 3 slots. The corresponding MUD efficiency is 

=1-[^^J =1-(0.51)^ =0.74 (31) 
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5. Improved Channel Amplitude Estimates 

One method for significantly improving the channel amplitude estimates is to perform a 
second estimate directly on the data channels after the initial data channel demodulation. 
Performance is improved for two reasons. First, the entire slot can be used for integration. 
Hence we have M = 3(10) = 30 bits. Secondly, the error is not scaled by 1/y3e since the 
estimate is performed directly on the data channel. For this method we have 



I- 



= 1(1+0.35) ^^^^^^"^^ [(0.7333)^ + 0.4o] (32) 
Y (30)(256)'- 

= 0.29 

and the corresponding MUD efficiency is 

= 1 -(0.29)' =0.92 (33) 



Slightly better performance can be achieved by using both data and control channels. 
This method can be performed either on the daughter card or on the modem card since it 
is a single user method. The assumption is that the matched-filter BER is sufficiently 
good. 

6. Multiuser Channel Amplitude Estimation 

5 

fy Given the conventional channel estimates and the detected user bits it is possible to 

subtract the MAI which corrupts channel estimation. This method of channel estimation is 
referred to as multiuser channel estimation, as opposed to the conventional single-user 
estimation techniques. A simple multiuser channel estimation technique is presented 
below without analysis. Performance should be determined via simulation. 

From Equation (22) the conventional estimate is 

^/.=X^/...*^^+^/. ^^^^ 

kp 

A multiuser estimate Is obtained by subtracting the known interference among the channel 
estimates 
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kp*lq 



kp*lq 



(35) 



= a, — 



k* p'^lq ]q>^k" p' 



k'p'^lq 



where the (hopefully) improved multiuser channel estimate is denoted 5,^ . The first term 

above is the actual channel amplitude. The second term is the residual interference, and 
the last term represents thermal noise and other-cell interference, which is amplified by 
the multiuser interference subtraction. The extent of the amplification needs to be 
determined. 



7. Effect of Uncancelled Multipath Interference 



^0 



It is expected that a typical RAKE receiver will be capable of tracking up to approximately 
16 multipath components. Since the computational complexity of symbol-rate MUD is 
quadratic in the number of multipaths L it is unlikely that MUD implementations will be 
able to cancel all multipath interference. The effect of uncancelled multipath is assessed 
below. 



m 



Suppose that the RAKE receiver processes L' multipath components, but that the MUD 
implementation cancels interference for L <L' components. From Equation (13) we have 

^Ik '^/fc^P +^kp^lq -Ql^p} 

^ L L ^ ^ ^ L L' 



2 q=L+i ps=l 



^Ik ^-^^Y^^I^^K'^Ikqp-^K^lq -^Ikqp} 
^ qH p=l 



(36) 



^=1 ^ q=\ p^L+l 



+ 7 X YlK^kp -^Ikqp +^^^Iq'^JkqpS+7; S '^Ikqp + K% ' ^kpS 

and the variance is then 



2 ^=i+f p=L+J 
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p=i ^^"^l q=l p=L+\ 

2iV^ t3^p=L+l «=L+lp=l ^=L+lp=I,+l J 

S (37) 

^ Note that Ao^ is the ratio of the uncancelled to cancelled interference for the kth users. 

J«: Similarly, we have 

H £k}=^%^WA^ (38) 

Now, neglecting the second order terms jS^.A* and averaging over the users = E{fi^,,} 
we arrive at 
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2 


[l+lp 


r] 




2N, 




2 


6+1 p 


N 




2Ni 




2 


6+1 p 


r] 


2N, 



^ 2-6+1 pN j^2^^^^2)E2 +2j8,(a + /3,^M^} 



^0 



ry 



= 1-- 



(39) 



Note that /Jc is the ratio of the uncancelled to cancelled interference. 



In order to assess typical value for px multlpath models [1][2][3] were used to generate 
random profiles. The models are based on data collected in four areas (A, B, C, and D) in 
the San Francisco-Oakland bay area. Table 1 below summarizes the key results. The 
table shows the Px versus the number of multipath components L 

Table 1. Ratio (jix) of the uncancelled to cancelled interference. 





L = 8 


L = 6 


L = 4 


L = 3 


L = 2 


L=1 


Area A 


0.0019 


0.0064 


0.0481 


0.0961 


0.2376 


0.5819 


Area B 


0.0012 


0.0086 


0.0404 


0.1115 


0.1416 


0.5749 


Area C 


0.0004 


0.0054 


0.0291 


0.0948 


0.1649 


0.6603 


Area D 


0.0039 


0.0128 


0.0430 


0.0629 


0.1435 


0.4890 



Suppose j8x = 0.05 and (E(/4)^ = 0.51^ = 0.260. Without taking uncancelled multipath into 
account we found Pmud = 0.74. Taking uncancelled multipath into account we find 



1 



1 + 2^. 
1 



1 + 2(0.05) 
= 0.67 



[(0.51)^ +2(0.05)] 



(40) 



where a worst-case ^ = 0.05 is used. 
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8. Improved MUD Efficiency Due to Dropping Small Amplitudes 



If small amplitude multipath components are not included in the cancellation the MUD 
efficiency is reduced slightly due to the additional uncancelled multipath interference, but 
it is also increased because of the absence error resulting from the inclusion of these 
small noisy estimates. The net effect is a substantial increase in the MUD efficiency. From 
Equation (30) we have 



= (1 + 0.35) 



(128) r 0.40 
(18)(256)L (0.7333)^ 



(41) 



= 0.065 



I* 



where Edi^ is the error due to a single multipath (i.e. 



L = 1). From Equation (37) It is 
^ ^ then it is advantageous 



evident that if a particular multipath amplitude satisfies Akp < Edi 
not to incorporate this amplitude into the cancellation since the error is greater than the 
amplitude. Table 2 shows the mean number of paths E{L} which satisfy Ap^ > E^r^ and the 
ratio px of the uncancelled to cancelled interference if only these mulitpaths are cancelled. 
The MUD efficiency is then calculated using 



=1- 



1 + 2^3, 



E{L} 



(42) 



ru 



Ta ble 2. Improved MUD efficiency (^mud) due to dropping small amplitud es. 





E{L} 


ft. 


jS/MUD 


Area A 


2.0300 


0.0714 


0.7638 


Area B 


2.4660 


0.0691 


0.7482 


Area C 


2.2970 


0.0680 


0.7564 


Area D 


2.0690 


0.0625 


0.7748 


Mean 


2.2155 


0.0678 


0.7608 



9. Conclusions 

This report represents a first-look at channel estimation and the effect of errors on the 
MUD efficiency. Only the case where all users are at the highest spreading factor has 
been examined. The initial results indicate that if the conventional channel estimates are 
used the MUD efficiency drops to 74% due to estimation errors. If the effect of 
uncancelled multipath interference is also considered the MUD efficiency drops down to 
67%. If small amplitude multipath components are not included in the cancellation the 
MUD efficiency is reduced slightly due to the additional uncancelled multipath 
interference, but it is also Increased because of the absence error resulting from the 
inclusion of these small noisy estimates. The net effect is a substantial increase in the 
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MUD efficiency, which is increased to 76%. The actual MUD efficiency will, of course, be 
less due to other factors which degrade efficiency. If an improved single-user channel 
estimation is used the MUD efficiency can be increased to 92%. This improved method 
requires knowledge of the pre-MRC matched-filter outputs. It is perhaps possible to 
further increase the MUD efficiency by employing multiuser channel estimation. These 
techniques also require knowledge of the pre-MRC matched-filter outputs. The above 
referenced MUD efficiency numbers are based on 128 users processed by the 
basestation. If fewer users are allowed access to the system in order to increase range 
the MUD efficiency is unchanged shine the total interference and noise remains 
unchanged. 
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Appendix A 



In order to estimate the variance of the channel amplitude estimate we need the second 
order statistics 



m m' 



1 N \m \ 

m m' ^ I 



where we have used 



(A2) 



which is derived in Appendix B assuming random codes. In order to evaluate JE{/;^[m']} 
we consider two cases: 1) * = / , and 2) ki^i. For k = l yNe have 
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1 f M M 1 

M AU^' ]}= T7T ^1 2 £ tm] • ^7 Jn] . 6, [m - m1 • [« - m' ] 

M «=i J 



t MM 



whereas for ^ ^ / we have 



5 . M rn^l „=i 

^0 Hence, combining Equations (A3) and (A4) we have 



Equation (A1) then becomes 



(A3) 



i MM 



£{/lN']}=5« •5..o(^l-^j+^ (A5) 



= iW.A4^{^«-^.o(l-^)^} (A6) 
« tt „ pp I « 1^ M J MJ 

Now specializing Equation (A6) to the case wliere k = l 

» '«'/> ' Vp jv, [ iV, M J M j ^ ' 

The above expression is further, sinrplified if we assume that users are approximately 
synchronous so that NuqJO] ~ Ni, which gives 

EiH,J%^.l- (A8) 
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Similarly, specializing Equation (A6) to the case where k ^ I 



EiH^n.-^ (A9) 



in 



Appendix B 

In Appendix A we used the approximation 

E^^lm].c;,,,lmlh^SA^^^^^ (B1) 

under the restriction that Iqi^kp . We show here that this expression is exactly true for 
chip-synchronous users, and that the approximation is reasonably valid for chip- 
asynchronous users, particularly when differences in delay lag are greater than about 2 
chips. The analysis is based on random user codes. 

The user correlations can be explicitly related to the code correlations as follows 



W =C«[T«^[m]] 

k (B2) 



ZlVl i J 



C5 ^lkqp[^]^^T+Ti^-Tj^ 

m 

Consider two cases: 1 ) l=^k , and 2) / = ^ . 
Case 1 

When l^k the second-order statistics become 



£{cjT]C;iT']}=— 

= — - — y e -[T]- e....LT']-2d„.8.. ■2d,^.d ... 

^^l if 
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where we have used the assumption of random user codes, independent among the 
users. Note also that the summation over i is over the range where C([i] is non-zero, and 
similarly the summation over j is over the range where Ci^UJls non-zero. 

Case 2 

Now consider case 2 where / = k 



w 



When / 5t /' we have 



£{c„[T] • c,;.[T']}= -ly X sij m- grji^i- E^:m ■ c^n- c, in-ciu^ 

=^Ig,[T]g,v[T']-£{:;[/]-c,[/']c,[7]c;[/]} 



=^|x^(,M«iytf']'Efc*mc,[nc,[y]c;[/]} 

'-;VT J 
+ X^.t^l- gr/[T']-2£{:,[i']c;[/]}[ 



(B4) 



f3 = — y ?,.rT]-s...,[T'l-25.. -25..., 

|3 =5„.g[T]-g[T'] 

whereas when / = /'we have 



(B6a) 



(B6b) 
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ftS' 



In 



less:' 

u 



4n; 



i 



+ (l-Sj^!^J^gy[T]g,[T'] 



(B6c) 



+ J,g,m-grjlrl-2-2S,j. 

= 5„,g[T]g[T'] + -|f|x5o[-r]g.jtT']-iV,g[T]g[T']| 

Hence combining Equations (B5) and (B6c) we have 

E^,,[T]C;.,\r']}=5,,.g[r]g[r]+^^^^ (87) 
|«=^ and combining cases for / 5* it and / = ^ we have 



= 5„ ■S,,.g[T]- g[T'3+ ^" "^y^"' (B8) 

= S„ ■S,,g[t]-g[r]- ^" g[T]-g[T']+^^Xg>.tT] g,[r'l 

The above expression can be used to determine the second-order statistics for the 
general case of symbol-asynchronous and chip-asynchronous users with arbitrary 
spreading factors. In what follows we will be interested in approximating the above 
expression so as to get simple but meaningful results. In order to simplify the expressions 
we consider users all at the highest spreading factor, and we assume that certain small 
values are zero. 

To assess the accuracy of channel estimation we need to determine the second order 
statistics 

£^i.Jm] ' C;v,.[m']}= £{C jT;,^,[m]] - C;,.[T,,.,.,.[m']]} ^^^^ 
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with Iq^kp. The function flf[T]s(T'] in Equation (B8) above is small unless both x and x' are 
close to zero, and for the chip-asynchronous case function is exactly zero since unless 
both X and are equal to zero. Since for Iq^kp the probability that %mqp[m] is close to 
zero is small a good approximation is to assume that these functions are zero. The third 
term can be written 



jfc[T].c;jT']}=M-|_L2;g..[T].g,[T'] 



(BIO) 



The double summation in the brackets 



5«[T.tr'] = -i-Xg,[T]-g,[T'] 

1} 



(B11) 



is plotted in Figure B1 for Ni=Nk = 256 versus x - x' for (x + x' )/2 = 0. 
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Figure B1 . Plot of S^t.x'] for N,= Nk = 256 
versus t - x' for (t + x' )/2 = 0. 

The sharp localization around t - x' = 0 is valid for all values of (x -i- x' )/2, except that for 
(x + x' )/2 large peak value drops off due to the partial overlap of the codes. Hence for 
delay lag differences x - x' greater than about 2 chips a good approximation is 



(B12) 



This approxinDation then gives 



(B13) 
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which implies 



m 
m 
w 



5 y 



4c«^['«l c;.,.,.[m']}=-l-5„. S^. -S,,. -S^. -5^. ■S^[r,r] (B14) 
provided the delay spread is less than a symbol period. Now it can be shown that 



where Nikqp[m'] is the overlap between the user codes. Our final result is then 



(B15) 



(B16) 
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1. Multi-User Signal Model 

The Rake receiver operation described in the next section is based a signal model. The 
MUD algorithm and implementation are based on the same model. This model Is 
described below. 

Figure 1 shows how the uplink complex spreading for the Dedicated Physical Data 
CHannels (DPDCHs) and the Dedicated Physical Control Channel (DPCCH). There can 
be from 1 to 6 DPDCHs, denoted DPDCHk, for k from 1 to 6. If there is more than one 
DPDCH, then the spreading factor for all DPDCHs must be equal to 4. For a single 
DPDCH (DPDCHi) the spreading factor can vary from 4 to 256. The data bits for channel 
DPDCHi are spread by channelization code c^^^ = Cch,sF,sF/4, where SF is the DPDCH 
spreading factor. These channelization codes are referred to as Orthogonal Variable 
Spreading Factor (OVSF) codes. They are equivalent to Hadamard codes, except for their 
ordering. When there are multiple DPDCHs then dedicated channels DPDCHk, for /cfrom 
1 to 6 are spread by channelization codes Cti^i, = CchAn, where the relationship between n 
and k is represented in Table 1 . 

Table 1. Relationship between n and k. 



n 


k 


1 


1,2 


3 


3,4 


2 


5,6 



The data bits for the DPCCH are spread by code Cc = Cch,256,o- The spreading factor for the 
DPCCH is always equal to 256. The multipliers pc and are constants used to select the 
relative amplitudes of the control and data channels. At least one of these constants must 
be equal to 1 for any given symbol period m. 



Page No. 93 



EV 093 931 797 US 
Page No. 120 



o 

%0 



m 

m 



m 



DPDCH 



DPDCH, 



DPDCH, 



■Kx) Kx>-> 



— Kx) Kx)->U 



3.84 Mcps 



DPDCH2 — »<x) K^)— 

DPDCH4 — H^x) Kx)— ► 

DPDCHe »<X) KX)— > 

DPCCH KX)— ► 



Q ^ 



Figure 1 . Uplink complex spreading of DPDCHs and DPCCH 

The uplink spreading for any one of the seven Dedicated CHannels (DCHs) above can be 
represented as shown in Figure 2. 



4«] 



tsF I — Kx) 




Figure 2. A second representation of tiie uplinl< spreading 
for any one of the seven Dedicated CHannels (DCI-is). 



where the code c[n] \s given by 



c[n] = 



'cA.256,0 



DPCCH 
DPDCH, 



Cc.256,min]S^^[n], DPDCH 

Cc*.256.i92[n]A,[«], DPDCH 

CcH,256,m[n]SM DPDCH 

Cc,,^6,mln]- jSM DPDCH 



DPDCH, 



(1) 
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and 



P 



DPCCH 
DPDCH,_6 



(2) 



For a DCH with a spreading factor less than 256 there are J = 256/SF data bits 
transmitted during a single 256-chip symbol period (i.e. 1/15 ms). From a signal model 
perspective, the J data bits transmitted per symbol period can be viewed as arising from J 
virtual users, each transmitting a single bit per symbol period. The idea is illustrated in 
Figure 3. 



Coin} P 



13 

m 
m 



w 



bo[m]sb[0+mJ] » 



t256 <X)- 



Kx) — 




bj_,[m]^blJ-l-^mJ] 



Figure 3. Transforming a single user witti bit rate J bits per symbol period 
into J virtual users, eact) with bit rate 1 bit per symbol period. 

The codes for these virtual users are formed by extracting SF elements at a time out of 
the DCH code sequence to form J new codes. Each of the J codes is of length 256 chips, 
but with only SF non-zero chips. That is, 



SF<n<U+l)SF 
otherwise 



(3) 



This code-partitioning concept is illustrated in Figure 4 for the case SF = 64 so that J = 
256/SF = 4 codes are derived from the one DCH code. 
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t 



Figure 4. Code partitioning concept illustrated for the case SF = 64, 
whereby J = 256/SF = 4 codes are derived from a single DCH code. 



The control channel can also be viewed as a virtual user. Hence, for a given physical user 
with spreading factor SF there are 1 + 256 A/^/SF virtual users, where No is the number of 
DPDCHs. (Recall that for Nd>^,SF^ 4.) 



It turns out to be convenient to use a double indexing scheme to i dentify virtual users. Let 
y paired indices kj represent the /th virtual user associated with the /cth dedicated channel. 

^ Index i varies from 0 <= j < Jk = 256/SFk, where SFk is the spreading factor for the /cth 

12 dedicated channel. For the remainder of this section the spreading factors SF/f are 

y assumed to be constant. In section 3 the equations are reformulated to allow for symbol- 

ics: by-symbol changes in the spreading factor. 

f3 The transmitted signal for virtual user kj can be written 

^,jlt] = li,'£v,^[t^mT]b,j[m] (4) 



where t is the integer time sample index, T = NNc is the data bit duration, N ~ 256 is the 
short-code length, Nc is the number of samples per chip, bkj[m] are the data bits, and 
where Vkj[t] is the transmit signature waveform for virtual user kj\ This waveform is 
generated by passing the spread code sequence C/f/n/ through a root-raised-cosine pulse- 
shaping filter h[t] 

^,jlt] = J,h[t-pN^]c,^[p] (5) 

p=Q 

Note that pk = Pc if the kjth virtual user corresponds to a control channel. OthenA^ise = 
The total number of virtual users is denoted 
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where Kb is the total number of dedicated channels. The baseband received signal after 
root-raised-cosine matched-filtering can be written 

m = f,f,^s^lt-mT]b^[m] + M{t] (7) 

where w[t] is receiver noise with a raised-cosine power spectral density, and where Sj^j[t] 

is the channel-corrupted signature waveform for virtual user kj. For L multipath 
components the channel-corrupted signature waveform for virtual user kj is modeled as 

L 

?.;W = Z^*/>^*;[^-'Z^^] (8) 

|j where a^p are the complex multipath amplitudes. The amplitude ratios are incorporated 

into the amplitudes Skp. Notice that if k and / are two dedicated channels corresponding to 
the same physical user then, aside from scaling the by and p, a^p and a/p, are equal. 
;^ This is due to the fact that the signal waveforms of all virtual users corresponding to the 

I J same physical user pass through the same channel. The waveform Sk^t] is referred to as 

|J the signature waveform for the kjth virtual user. This waveform is generated by passing 

Iji the spread code sequence C/cy/n/ through a raised cosine pulse-shaping filter g[t] 

B s,An=j^g[t-pN^]c,^p-\ (9) 

p=0 

Its*! 

;| Note that for spreading factors less than 256 some of the chips Ck^p] are zero. 

2. Rake Receiver Operation 

This section describes the operation of a typical Rake receiver. Figure 1 shows a 
representation of the received antenna data that is delivered to the Rake receivers of all 
users. 
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processing frame 

Figure 5. Received antenna data delivered to the Ral<e receivers of all users. 

The figure shows the received signals corresponding to users / and k. These signals are 
combined in free space so that the receivers gets one composite signal, which we denote 
r[t]. The buffer length is assumed to be an integral number of frames in length so that 
delay lag values Tfq are approximately constant with each new filling of the buffer. For 
each finger of each user there is a delay lag value Tfq indicating the start of frame for the 
qth multipath of the fth user. Lag values Tiq are assumed to be constant over a frame, but 
are allowed to change from frame to frame in response to the delay locked loop operation 
and in response to new searcher-receiver sweeps where new delay lags are found. The 
lower case values T/g = T/g moof 256Nc denote the symbol -period offset relative to the start 
of an internal symbol period reference clock. Notice that the user spreading factors 
change on user frame boundaries. Since users are asynchronous it is impossible to have 
a MUD processing frame that corresponds to all user frame boundaries. Hence the MUD 
processing frame is matched as close as possible to the user frame boundaries, but does 
not necessarily correspond precisely to any user's frame boundary. Consequently there 
will be spreading factor changes that occur during a MUD processing frame. Handling 
these mid-frame changes is the subject of section 3 below. 



The received signal above, which has been match-filtered to the chip pulse, must next be 
match-filtered by the user code-sequence filter. Since the spreading factor for the 
DPDCHs is not known, the Rake receiver performs an initial 4-chip despreading over all 
DPDCHs. The Fast Hadamard Transformation (FHT) can be used here to reduce the 
number of operations. The detection statistics for the multiple fingers and multiple 
antennas are maximal-ratio combined. Since the DPCCH is always spread with a 
spreading factor of 256 the DPCCH can be entirely despread during each symbol period. 
TFCI bits are extracted each slot from the DPCCH. After an entire frame is processed the 
TFCI is decoded and the spreading factor for that frame is determined. After spreading 
factor determination the final DPDCH despreading is performed. The resulting detection 
statistics are denoted here as y/c/m/, the matched-filter output for the kfih virtual user for 
the mth symbol period. Since there are Ky codes, there are /C such detection statistics. 
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which are collected into a column vector y[m] for the mth symbol period. The matched- 
filter output y^mj, for the llth virtual user can be written 



! n 



(10) 



where a^^ is the estimate of a^^, f,^ is the estimate of T/^, and A// is the (non-zero) length 

of codes Cfi[n] (i.e., the spreading factor for the Ah dedicated channel). The intermediate 
result yiiqlm] represents the despread signal at the cjth lag, and is here referred to the pre- 
MRC matched-filter output. When multiple antennas are employed, r[t], yhjmjand a^^ are 

column vectors with one complex element per antenna. 

The matched-filter detector estimates the transmitted data bits as b^i[m^ = sign{yi^m\] . 
Multiuser detection is considered in the next section. 

%0 3. Multiuser Detection Equations and Asynchronous Processing 

1^ As shown in Figure 5 a MUD processing interval must necessarily by asynchronous with 

S most user's frame boundaries since the users are asynchronous. Because of this 

spreading factors will change during a MUD processing frame. When the spreading factor 
f*^: changes during the processing frame the MUD equations are modified. These 

Ya modifications are considered in this section. 

The modem delivers matched-filter data to the MUD function on a frame-by-frame basis, 
p Let Np[r] represent the number of physical users accessing the system during frame r. For 

fy each frame the following data is received for physical users p = 1 to Np[r] and each 

dedicated channel / 

• Number of DPDCHs, No.p 

• Spreading factor, SF/ 

• Amplitude ratios fid ^nd pc 

• Slot format 

• Channel amplitude estimates aiq 

• Channel lag estimates Tiq 

• Matched-filter outputs ffi[m] ior all DCHs 

• Code numbers 

• Gap information for compressed mode 

Matched-filter outputs fnfm] correspond to the matched-filter outputs ynfm]. If the Ah 
dedicated channel is a DPCCH then matched-filter outputs are only received for the TPC, 
TFCI and FBI bits. The ///mj values are mapped to the yz/mj values as described below. 
The mapping accounts for the frame offsets between the various users. The amount of 
matched-filter data received per physical user depends on the DPDCH spreading factor. 

For each dedicated channel a symbol offset mi is determined according to 
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iv (256 J 



(11) 



where cZ/V denotes integer division (i.e. with truncation). The symbol offset represents the 
fact that the users and hence the frame data are asynchronous. The y-data used for 
interference cancellation is derived from the frame data using 



yii[m] = Mm-mi] 



(12) 



Figure 6 shows an example mapping of user data frames to MUD processing frames. To 
illustrate concepts the frames are each 16 symbol periods long rather than the actual 150 
symbols for WCDMA. The height of the blocks represents the number of virtual users per 
physical user. For physical users 1 and 4 the spreading factor changes in going from data 
frame 1 to data frame 2. As shown in the figure this results in spreading factor changes 
within the MUD processing frame. The MUD function is designed to Calculate the C- 
matrix once per frame. Hence mid-frame changes to user spreading factors pose a 
problem which requires special treatment. It turns out, and will be shown below, that mid- 
frame changes to the spreading factor can be accommodated by performing nriodified 
calculations based on the minimum spreading factor over the MUD processing frame. 



m 



User data frames 
Frame 1 Frame 2 



MUD processing frames 
Frame 1 Frame 2 



llJ 
I* 



nj 



Physical User 1 
Physical User 2 

Hiysica] User 3 
Physical User 4 




Figure 6. Mapping of user data frames to MUD processing frames. 



First we develop the MUD matrix signal model which allows user spreading factors to 
change on a symbol-by-symbol basis. We then show how we can perform the processing 
based on the minimum user spreading factors over the MUD processing frame. 

Let us reformulate the signal model presented in section 1 so as to allow spreading 
factors to change every symbol period. For every DCH k, there are Jk[m] virtual users, 
where index m is the symbol period index. The number of DCHs Jk[m] \s 

A[m]--|^ (13) 
SFjt [m] 
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where SFk[m] is the spreading factor for the /cth dedicated channel during the mth symbol 
period. The signature waveform for the fih virtual user of Jk[m] total belonging to the /cth 
DCH over the mth symbol period can be written 

^,jJt] = I,git~pNcKJp] (14) 

where the codes and hence the signature waveforms now include the symbol-period index 
m to account for symbol -by-symbol spreading factor changes. The channel-corrupted 
signature waveform is then 

L 

and thus the received signal corresponding to Kd dedicated channels is 
i: r[f] = £X J^s,.Jt-mT]b^[m] + m (16) 

k=l m j=0 

'^0 The MUD matrix signal model proceeds from substituting the received signal r[t] from 

^ Equation (1 6) into Equation (1 0) for the matched-filter outputs 

fS r 1 

S n k=l j=0 [q=l r J 

13 

z :: 

p (17) 



L L ^ 1 



9=1 p=\ 

Cukj,, [m. n] ^ ^^^^^^ ^^8[(r- s)N^ +(m-rt)T+T% -T^]C[r]-c,^,„[s] 

where rj/m; is the match-filtered receiver noise and Ni[m] = SFi[m]. The terms for m'oO 
result from asynchronous users. 

The delay lags xjq for a given DCH / will under most circumstances be grouped within a 
range of from 4 to 8 |is. Under extreme conditions the delay spread will be as high as 20 
\is. In any event, let t/ represent the mean delay lag Zfq over index q. According to 
Equation (10) above, the matched-filter detection statistic y//0/ is the result found by 
correlating the received signal starting roughly at delay lag x/, where x/ is approximately in 
the range 0 to 256A/c. If x/ moves significantly outside this range an adjustment in the 
symbol period alignment will need to be made to restore x/ back to within the desired 
range. More will be said about this below. Along the same lines, the detection statistic 
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ydm] is the result found by correlating the received signal starting roughly at delay lag t/ + 



For efficient MUD processing it is important for the C-matrices to be constant over a 10 
ms MUD processing frame. We now describe a method which operates on constant C- 
matrices. Handling changes to user spreading factors is relegated to the IC portion of the 
MUD processing. Let us define 

Jj, =maxyjm] (18) 

m 

where the maximization is over symbol periods m that contribute to the current MUD 
processing frame. This includes not only symbol periods that fall within the MUD 
processing frame, but in addition a few symbol periods on either side due to 
asynchronous users. Note that the minimum spreading factor for the /cth DCH is SFk = 
256/Jk. Now define the DCH contraction factor ior the mth symbol period as 

^ts■. 

I ^^[^l^TTl ^^^^ 

eg The DCH codes for a given symbol period can be expressed as a sum of the DCH codes 

corresponding to the minimum spreading factor. For the /cth DCH there are at most Jk 
y virtual users corresponding to the minimum spreading factor. Let the codes for these 

s users be denoted Ckj[r], 0 <= j < Jk- The codes for the mth symbol period, where there 

O might be fewer virtual users, are denoted Ckj,m[rl 0 <-}< Jk[nfi], where 

W 

1^ (j+l)Q[mH 

P X^./f'-l (20) 

r=jcam] 



m 



With this result we are now able to represent the MUD signal model in terms of the C- 
matrix and R-matrix elements based on the codes corresponding to the minimum DCH 
spreading factors. The C-matrix in Equation () above becomes 

^ (/+l)C,[mH {j+iyC,[n]-l 

= ^T^IZg[('-^)^c+(m-n)T+f„-r^] i:e,,[.] 

(i+l)C/[m]-l (;+i)ca«]-i ^ 

= 1^7-, X I ^i:igK'-^)A^c+('«-«)7'+T,,-v]c;,[r].c,,W 

i^llfff-i r=/C;[m3 f^jC^ln} r s 

^v^imj i'=,.c,[m^ r=jCM 
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where Ni^min Ni[m]= SF/. Similarly, the R-matrix becomes 



m 



m 



L L ^ 1 



(/+l)Q[mH (j+l) CanH L L 



N^[m] r=iCi[m] r=jC,[n] q=l p^l 

'IX r,,,[m-«] (22) 



N^[m] r=;C;[m3 /=jCa«] 



^=1 p=i 

SO that the matched-filter outputs become 



n k=l J=0 



= SIX ^ ^ r,,,,[m-n]U [«]+^;,['n] 

„ jfc=l y=o l^/L'WJ r=/C,[m3 /=jCi[nl J 



(23) 



lij This last equation can be written 



n k=l 7=0 [^/L'WJ /WQtm] J 



(24) 



''V(['«] /•=.C,[m) 

y«-N]=XX X 1 X r,,^['"-«]kw 

= XX X 1 X ^ 
= X X X*"«vf'" - • 

where we have defined bf,j{n]= bkj[n]ior \Ck[n] <= j' < (j + 1)Ck[n]. Equation (24) is based 
entirely in terms of matrix elements corresponding to the minimum spreading factor for the 
MUD processing frame. 
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1. Introduction 

Multiuser Detection (MUD) is most often thought of as a technique to improve either 
capacity or coverage for the uplink. A few reasons why MUD is uplink-focussed are 

• Downlink MUD must be performed in the handsets, which are limited in processing 
power 

• Each handset is interested in only one signal 

• In the downlink users are separated by orthogonal codes 

However, there is typically a greater demand for capacity in the downlink. If MUD is only 
applied in the uplink the imbalance is even greater. While in the downlink users are 
separated by orthogonal codes, because of multipath there is still significant intra-ceil 
interfernece. Equalization has been suggested as a means of restoring orthogonality, 
however the computationally attractive linear equalization methods tend to amplify the 
othe-cell interference and noise. 

A downlink MUD method is described in the next section which has reduced complexity. 
The Fast Hadamard Transform (FHT) is used to reduce complxity. The FHT is used in 
both the forward (demodulation) and backward (regeneration) directions. 



2. The Method 

The method proceeds according to the following steps 

• Receive amplitude and delay information form the searcher receiver 

• Start with the largest multipath 

• Multiply the received signal by the conjugate of the scrambling code (512 chips at a 
time) 

• Perform the FHT on the result (for multirate users, this is done in stages) 

• Determine soft data estimates 
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Set user-of-interest data symbols to zero. 

Do same for all multipaths 

Proceed till end of slot 

Estimate amplitudes and gain factors 

Diversity combine results and make hard decisions 

Use hard decisions, gain estimates and FHT to reconstruct chip sequence c[n] (with 
user of interest nulled) 

Multiple c[n]by Csh[n] to form d[n] (with user of interest nulled) 

Use amplitude estimates, delay lag estimates (from searcher) and raised-cosine pulse 
to construct chip filter 

Pass d[n] (with user of interest nulled) through chip filter to reconstruct interference 
signal 

Subtract interference signal from received signal 
Demodulate with conventional RAKE receiver 



1^ 

a 

C3 



The WCDMA transmitted signal can be represented as 

s[t] = J^g[t-nNJd[n] 



m 
m 
w 

111 

o 



0 



= c[n]c,,,[n] 

K 

c[n] = XGA[« div N^]-c^^j\n\ 
k=i 

where g[t]'\s the raised-cosine pulse\ Nc is the number of samples per chip, and d[n] is 
the composite chip sequence from all users. The received signal is then 



= ia^j;^g[t-r^-nNJd[n] 



0 



q=l n 

The received signal advanced to the delay of interest is 



^ The chip-matched filter is artif Iciaily placed in the transmitter for simplicity of presentation 
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9=1 
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0 



q=l 
L 



The received signal multiplied by the conjugate of the scrambling codes is 

L 

• c[n] + w[rt] 



g=i 

" L 



= 5^ 'C[n] + w[n\ 



0 



2p = X^.^t^P -'^J 



This result can now be demultiplexed using the 512 x 512 FHT. Since 512 = 2^ the FHT 
proceeds in 9 stages. After the first two stages the SF 4 symbols can be extracted. 
Similarly, after k stages the SF 2" symbols can be extracted. The amplitudes can be 
determined from the embedded pilot symbols, or searcher-receiver estimates can be 
used. If embedded pilot symbols are used the measurements Mp^ of the pth multipath of 
the Mh user is in the fomi 



0 



which includes the user gain factor. After measurements are taken for all multipaths and 
all users for a given slot, the multipath amplitudes and user gains can be separated by 
determining the dominant left and right singular vectors of the rank-1 matrix Mpk (aside 
from an arbitrary scale factor which can be given to either amplitudes or the gains). One 
the approximate amplitudes are known the actual amplitudes ap are determined by 

inverting the diagonally dominant system of equations 



L 

9=1 



The chip filter h[tj for reconstructing the interference signal is 



0 
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30 

31 1 Purpose 

32 The piapose of this memo is to document parts of the discussion we have been 

33 having on how the H 6414 DSP may connect to the raceway. 

34 
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35 2 Glossary 

36 EMIF - A port on the DSP 6000 series peripheral bus which allows the 

3 7 connection of memory devices. 

3 8 SDRAM - In the context of this memo, means the main external memory of the 

39 TI DSP - the one which contains the program and data. 

40 3 Overview 

41 

42 So far, a proposed architecture is that we use the second EMIF (External 

43 Memory Inter-Face) of the TI 6414 DSP to connect to a dual ported RAM. 

44 Raceway transfers actually access the RAM, and then additional processing takes 

45 place on the DSP to move the data to the correct place in SDRAM. In fact, if the 

46 dualport RAM is not large enough to buffer an entire Raceway transfer, then there 
3 47 will have to be a messaging protocol between the two endpoint DSPs wishing to 
|J 48 exchange messages (because the message will have to be firagmented in order to 

49 not exceed the reserved buffer space). 

^ 50 An additional restriction of this design is that as more Raceway endpoints are 

^0 5 1 added, the size of the dualport RAM needs to be increased, or the maximum 

eg 52 Jfragment size needs to shrink, such that the RAM is big enough to contain at least 

t$ 53 2*F*N*P buffers of size F, where F is the size of the fragment, N is the number of 

III 54 Raceway endpoints with which this DSP can exchange messages, P is the number 

^ 55 of parallel transfers which can be active on any endpoint at a time, and the 

C3 56 constant 2 represents double buffering so that one buffer can be transferred 

||J 57 to/from the Raceway, while a second buffer can be transferred to the DSP. The 

5g constant becomes 4 if you want to be able to emulate a full duplex connection. 

*p; 59 With a 4 node system, this might be 4*8K*4*4 or 5 12K plus a little extra for 

13 60 bookkeeping information. This probably means the minimvim size is IM bytes for 

I 'J 6 1 the dual port device. 

62 4 Problem Identification 

63 There are several characteristics of this architecture which could prove 

64 problematic: 

65 4,1 Requirement For A Fragment/Defragment Protocol 

66 Raceway transfers can currently be very long. This architecture would require 

67 a protocol for breaking transfers down into fragments. If the DSP is sourcing a 

68 transfer greater than the fragment size, then it has to either dedicate itself for the 

69 period of the transfer to programming the DMA engine, or it has to respond to 

70 interrupts as each fragment is transferred. In either case, there is a substantial 

71 performance impact above and beyond the normal performance hit due to 

72 memory bandwidth utilization. 
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73 If the DSP is on the receiving end of a Raceway transfer, a similar process has 

74 to take place, except that there must be an interrupt to get the attention of the DSP 

75 (polling would not be sufficient in such a case). 

76 Beyond the performance hit such a protocol would impose on the DSP, there is 

77 a major disadvantage in that only endpoints willing to implement this protocol can 

78 exchange data with the DSP. It is in effect, defining a defacto standard subset of 

79 Raceway. This is a major interoperability issue (you can no longer plug a board of 

80 DSPs into a fabric and have them work as a standard Raceway Adjunct 

81 Processor). 



82 4.2 Requirement For The DSP To Be Running Code 

83 If the DSP is involved in the Raceway transfers, then the DSP must akeady be 

84 running in order to perform Raceway transfers. This will require that all nodes on 

85 the Raceway be self booting. 

86 4.3 Lower Transfer Rates 

1^' 87 Raceway is less efficient with smaller transfer sizes. If the fragment size is kept 

88 small to minimize dual port ram requirements, then aggregate Raceway transfer 

89 rates will be lower because of less effiicient utihzation of the fabric. 



%^ 

W 90 4.4 It Is Different 

91 By changing the way Raceway works, we initiate a significant departure firom 

92 the way all current Mercury systems work. While there are many other possible 
|3 93 architectures which will perform well, it is inherently risky to change a 

Jfc^ 94 ftmdemental model of how our multiprocessors communicate. 

J5 95 5 An alternative Architecture 

ry 96 It may be possible to implement a different architecture which addresses some 

97 of these shortcomings. 
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5- f Architecture Description 

The proposed architecture still has approximately the same hardware as the 
existing architecture. The changes are in the way that the Raceway transfers move 
between SDRAM and the Raceway. 

In the proposed architecture, the FPGA connects to both the buffering device 
(dual port RAM or FIFO) and the DSP. The connection to the buffering device 
(hereafter FIFO) is used to move Raceway data to/from the FIFO. 

The second connection is to the DSP Host Port. Dave currently believes this is 
a moderately high performance interconnect - on the order of 75 Mbytes per 
second. This interconnect could itself be used to move data to/from the DSP. The 
host port can access data in the DSP on-chip memory, as well as any of the 
peripheral devices, including the SDRAM. However, 75Mbytes per second is 
pretty slow compared to normal Raceway bandwidth, and we think we can do 
better. 

The 6414 contains a second EMIF which can be attached to the FIFO (this is 
similar to what the current architecture proposal intends). The difference in this 
proposed architecture is that rather than have the DSP program the DMA engine 
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118 to move data between the FIFO and the DSP/SDRAM, we propose that the FPGA 

119 can program the DMA engine directly via the Host Port. 

120 The Host Port is a peripheral like the EMIF and the Serial Ports. The difference 

121 is that the Host Port can master transfers into the DSP datapaths, i.e. it can read 

122 and write any location in the DSP. Because the Host Port can access the DMA 

123 Controller (we think), it can be used to initiate transfers via the DMA engine, 

124 The advantage of this architecture is that Raceway transfers can be initiated 

125 without the cooperation of the DSP. Thus, the DSP does not have to be self 

126 booting. Performance is increased in two ways: the DSP is free to continue to 

127 compute while Raceway transfers take place, and performance on the Racway is 

128 increased because there is no need to fragment messages. 

129 The internal datapaths of the DSP are flexible enough that we can control 

130 which devices have priority access to memory and datapath. Specifically, we can 

131 choose to give Raceway transfers priority over the CPU, or vice versa. 



132 5.2 Synchronization Issues 

13 133 There is an issue to be solved in how we match data rates between Raceway 

^0; 134 and the DSP. The EMIF looks to the DSP as if it were a memory, thus it is 

%0 135 reasonable for the DSP to assume it can get at the data it needs at any time. 

136 However, if we indeed use a FIFO to buffer data, the implication is that there is a 

1.8 137 way to hold off the DSP when we are waiting for the Raceway to empty or fill our 

lij 138 FIFO. A possibility is that the buffer device remains a dual port RAM rather than 

s 139 a FIFO, and the FPGA actually does a fragment/defragment into the RAM, and 

13 140 then programs the DMA engine to move that fragment into/out-of the DSP. This 

llJ 141 starts to look somewhat like the original architecture, except that because the 

1^ 142 FPGA performs the frag/defrag, the actual transfers over the Raceway can be 

143 arbitrarilly sized (assuming we can throttle the Raceway). 

144 Synchronization remains one of the larger problems to be solved with this 

145 proposed architecture. 



146 5.3 Sample Transfers 

147 In order to illustrate how this architecture would work, two examples are 

148 given. The first example is when the Raceway attempts to read data out of the 

149 DSP memory. 

150 5.3.1 Raceway Reading DSP Memory 

151 In this example, we assume that another DSP is trying to read the SDRAM of 

152 the local DSP. 

153 1) The FPGA detects a Raceway packet arriving, and decodes that it is a read 

154 of address 0x10000 (for instance). 

155 2) The FPGA writes over the Host Port Interface in order to program the 

156 DMA engine. It programs the DMA engine to transfer data starting at 

157 location 0x10000 (a location in the primary EMIF corresponding to a 

158 location in SDRAM) to a location in the secondary EMIF (the buffer 
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device/FIFO). As data arrives in the buffer device, the FPGA reads the 
data out of the buffer device, and moves it onto the Raceway. When the 
proper number of bytes have been moved, the DMA engine finishes the 
transfer, and the FPGA finishes moving data firom the FIFO to the 
Raceway. 

5.32 Raceway Writing DSP Memory 

In this example, we assume that another DSP is trying to write to the SDRAM 
ofthe local DSP. 

1) The FPGA detects a Raceway packet arriving, and decodes that it is a 
write of location 0x20000 (for instance). 

2) The FPGA fills some amomt of the buffer device with data fi*om the 
Raceway, and then: 

3) The FPGA writes over the Host Port Interface in order to program the 
DMA engine. It programs the DMA engine to transfer data from the buffer 
device (secondary EMIF) and to write it to the primary EMIF at address 
0x20000. 

4) At the end ofthe transfer, we could either interrupt the DSP to signal that 
a Raceway packet has arrived, or we can use the standard Mercury method 
of polling a location in the SMB to see whether the transfer has completed 
yet. 

5.4 Additional Tliougtits 

1) We need to verify that the Host Port Interface can program the DMA 
engine. The documentation on the 6201 clearly states that it can write to 
any location in internal memory, and to anywhere on the peripheral bus, 
however the DMA engine/controller is the datapath controller for all that, 
so it is always possible that there is a special case which does not allow 
writing ofthe DMA engine/controller registers from HPI. The chance of 
this being so is quite remote, but needs to be verified. 

2) We need to understand the transfer rates and latencies of the HPI. This 
architecture relies on fairly low latency access through the HPI, otherwise 
more buffering space would be required, and at some point bandwidth 
begins to be affected. 

3) We need to understand the limitations of Raceway with respect to 
throttling, etc. The best case would be that Raceway can provide data as 
fast as the EMIF can take it (so we wouldn't worry about having data 
ready when EMIF wanted it), and also for Raceway to be able to be 
throttled so that it can take the data at the rate the EMIF can provide it. 
The more the reality deviates firom this best case scenerio, the more extra 
logic is required in the FPGA until at some point complexity may prevent 
the architecture from being viable. 
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199 4) What we currently know about the 6414 is actually educated guesses 

200 based on documentation of earlier DSPs. We are making some 

201 assumptions about how TI will have enhanced their chip. 

202 5) If/when H ever puts a RapidIO interface on their DSPs, it will almost 

203 certainly look like a high speed HPI, i.e. it will sit on the peripheral bus, 

204 have a separate datapath channel, data coming in will simply flow to the 

205 correct addresses, and outgoing data transfers will happen by 

206 programming the DMA engine to send data to the RapidIO peripheral 

207 address. This proposed architecture looks almost exactly like that, and so 

208 probably will not require major changes to use a RapidIO enhanced DSP. 

209 6) There are probably more thoughts., but this is probably a good start. . , 
210 

211 

212 

13 213 

^«?' 
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6201 Design Options 
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Option 1 is the original proposal submitted at the DSP meeting Monday. Option 2 was created during the 
meeting. 

The main shortfall in Option 1 is the sharing of the EMIF bus between the 6201 and the Raceway 
DMA FPGA. During DMA operations over the Raceway, the 6201 will not have access to the EMIF 
interface. Any data or instruction fetches from SDRAM will stall. Given the relatively small size of the 
internal SRAM, this will impose a significant penalty to the operation of the 6201 . Option 1 also requires 
the FPGA to take over SDRAM refresh operation when it takes control of the EMIF bus. This passing back 
and forth of the refresh task will not be clean. 



Option 2 places a bi-directional transceiver between the 6201 's EMIF bus and the Raceway 
SDRAM. This allows the 6201 to process data and fetch instructions without any interruption from it's 
local SDRAM while the DMA FPGA is accessing the Raceway SDRAM. The HPI interface is used by the 
6201 to program the DMA engine and by the DMA engine to indicate the DMA complete status to the 
FPGA. Option 2 also lends itself to a dual 6201 node per raceway interface. Decode logic, controlling 
access to the Raceway SDRAM can be designed in a number/combination of ways: 

Total access to both 6201s 

Separate areas for each 6201 

Read but no write to the other 6201 's memory space 

A separate common area accessible to both for message passing 

The ability of one 6201 to go through the transceiver to the others local SDRAM (not recommended) 
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For a migration story to the 6414, Option 2 is a better sell. Option 3 shows the 6414 design, the transceiver is 
stripped off and the Raceway SDRAM is connected to the second EMIF. The design will go to one DSP per 
raceway due to the increased in processing power of the 6414. 



RACEWAY 




Option 3 
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C3 

^ 1. Introduction 

tS Typical processing: 

w 

Signal is sampled at N samples per chip, 
|3 Despread by 

^ upsampling chipping sequence by interpolating and using the RRC chip pulse matched filter as an 

interpolation filter 

Multiplying digitized receive signal by upsampled and interpolated chip sequence 
Accumulate (integrate) results for an entire DPCCH symbol. 

Repeat at the early lead and late lag sample offset values to calculate delay locked loop variables 
Sweep the code correlator N*256 lags to determine code synchronization and channel response 



Spreading sequence is 256 chips long 

Typical filter is 12 chips long 

typical oversampling rate on the receiver is N=8 

Key calculations 

Interpolation of the spreading code - precomputed and stored 

Correlation process: N*256 CMAC 

Correlation repeated for N*256 + 2 (DLL) times 

Total CMACS: N*256 * (N * 256 + 2) = N^2*65536 + 512 * N 

For N = 8, this results in: 4,198,4(X)CMAC 

1 CMAC = 4 RMUL + 2 RADD = 6 ROP 



Results in 25,190,400 Real operations 

At 15000 Hz symbol rate, need: 378 GOP/s 
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2. A New Design 

Use of FFT to perform efficient circular convolution of spreading code sequence 
Results in 

Short code synchronization ( chip sync only, not slot or frame ) 
DPCCH demodulation 

Early and late Delay Locked Loop variables 

Rough channel estimate values for an entire symbol worth of differential delay 
Polyphase signal processing 

Digitize the signal at an Nx oversample rate and filter with the RRC filter and split into N streams at 
the Ix rate. 

Compute the complex conjugate of the FT of the spreading code sequence at the chip rate - 
precomputed and stored 

p Computation: 

Filter data at Nx oversample rate and split into N streams at Ix rate 

A 

%§ For each stream, 

19 Compute 256 point FFT 

y Complex multiply FFT with stored FFT values of spreading code 

« Inverse 256-point FFT 

W Ops calculation: 

N Input filter: could be done using FFT as well. 

*P but for time domain processing: 8*256 points, filter length 96 => 

p 96 RMUL per point, 95 RADD per point. 

Total of 19608 RMUL, 194,560 RADD per symbol == > 391,168 ROF per symbol 

I and Q streams, => 782336 ROP 



Stream processing ( 8 streams ) 

Radix 4 FFT: 256*4*(4 CMUL + 8 CADD) = 34,816 ROP 
256 CMUL = 1536 ROP 

Radix 4 IFFT: 256*4*(4 CMUL + 8 CADD) = 34,816 ROP 
TOTAL per stream: 71168 ROP 

Total stream calcs: 569,344 ROP 

Total ops per second at 15000 Hz symbol rate is: 20.3 GOPS 
more than 18 times more efficient than traditional approach. 

Also, the DLL circuitry can be eliminated since the entire channel response is calculated at the 
symbol rate. 

FFT numbers may be off by a factor of 2 larger in the number of complex multiplications needed. 
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Introduction 



5 ^ Multi-User Detection (MUD) has been shown to provide a number of significant benefits[l][2]. These 

include increased system capacity, increased range, enhanced Quality of Service (QoS), improved near-far 
resistance, extended battery life, and reduced handset transmit power. This paper describes the practical 

13 implementation of Multi-User Detection (MUD) for the UMTS uplink using short codes. The focus is on 

\Q practical implementation details such as efficient implementation of the calculations, processing 

^ requirements, latencies, MUD efficiency, and mapping to hardware. 

iti 

The use of short codes allows MUD to be performed at the symbol rate. As such MUD can be mtroduced 
into a conventional Base-Transceiver-StaticMi (BTS) as an eaihancement to the Matched-Filter (MF) RAKE 
li" receiver. The MUD processing takes the MF detection statistics, performs interference cancellation, and 

s then delivers improved hard or soft-decision symbol estimates to the symbol-rate BTS processing 

Q functions. The MUD processing introduces only a few milliseconds latency. Because of the reduced 

IjH computational complexity of MUD operating at the symbol rate die entire MUD functicmality can be 

implemented in software on a single card or daughter card populated with a minimal number of processors. 
We present here an implementation of an iterative hard-decision Interference Cancellation (IC) algorithm 
on four Power PC 7410 processors. The processors are connected together with a hi^-bandwidth RACE++ 
l3 interconnect fabric. 



m 



In order to perform MUD at the symbol rate die correlation between the user channel-corrupted signature 
waveforms must be calculated- These correlations are stored as elements of matrices, here referred to as the 
R-matrices. Since the channel is continually changing these correlations must be updated in real time. 
There are two elements to updating the R~matrices. The first part is based on the user code correlations. 
These depend on the relative lag between the various user multipath components. It is assumed that these 
lags change with a time constant of about 400 ms. The second part is due to the fast variation of the 
Rayleigh-fading multipath amplitudes. It is assumed that these ampHtudes are changing with a time 
constant of about 1.33 ms. The R-matrices are used to cancel the multiple access interference through the 
Multi-stage Decision-Feedback Interference Cancellation (MDFIC) technique. 



UMTS Uplink Multi-rate Signal Model and RAKE Processing 

We derive here the equations describing the MF outputs based on the WCDMA transmitted waveform. 
The users accessing the system will hereafter be referred to as physical users. Each physical user is 
regarded as a composition of virtual users. Each virtual user transmits a single bit per symbol period, where 
by symbol period we mean a time duration of 256 chips (i.e. 1/15 ms). The number of virtual users, then, 
for a given physical user is equal to the number of bits transmitted in a symbol period. At a minimum each 
active physic^d user is composed of two virtual users, one for the Dedicated Physical Control Channel 
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(DPCCH)[3] and one for the Dedicated Physical Data CHannel (DPDCH). If the physical user is a data 
user with Spreading Factor (SF) less than 256 then there are J = 256/5F data bits and one control bit 
transmitted per symbol period. Hence for the rth physical user with data-channel spreading factor SFr, there 
are a total of 1 + 256/SFr, virtual users. The total number of virtual users is denoted 

The transmitted waveform for the rth physical user can be written as 

l+Jr 

where t is the integer time sample index, T = AW^ is the data bit duration, N = 256 is the short-code length, 

Nc is the number of samples per chip, and where = pc if the Mi virtual user is a control channel and = 
M; if the itth virtual user is a data channeL The multipliers Pc and pd are constants used to select the relative 

amplitudes of the control and data channels. At least one of these constants must be equal to 1 for any given 
Q symbol period m. The waveform sjt] is referred to as the transmitted signature waveform for the ktk 

^fi virtual user. This waveform is generated by passing the spread code sequence cdn] through a root-raised- 

.^g cosine pulse shaping filter h[t]. If the Mi virtual user corresponds to a data user with spreading factor less 

than 256 then the code cdn] still has length 256, but only of the 256 elements are non-zero, where TV^ is 
^ the spreading factor for the Ath virtual user. The non-zero values are extracted from the code Cch,256,64 

CB Ssk[n]\?\. The W-CDMA standard actually allows for up to six DPDCHs to be multiplexed with a single 

III DPCCH. This functionality is not presently incorporated in the MUD algorithms described below. 



The baseband received signal can be written 

(3) 



where w[t] is receiver noise, [t] is the channd-corrupted signature waveform for virtual us^ k, L is the 

number of multipath components, and a^^- are the complex multipath amplitudes. The amplitude ratios Pk 
are incorporated into the amplitudes aa^-. Notice that if k and / are two virtual users corresponding to the 
same physical user then, aside from scahng the by pk and pi, a^g- and <3/^>, are equal. This is due to the fact 
that the signal waveforms of all virtual users corresponding to the same physical user pass through the same 
channel. The waveform sjt] is now the received signature waveform for the kth virtual user. This 
waveform is identical to the transmitted signature waveform given in Equation (2) except that the root- 
raised-cosine pulse h[t}\s replaced with the raised-cosine pulse g[ t]. 

Thus far the received signal has been match-filtered to the chip pulse. It must next be match -filtered by the 
user code-sequence filter. The resulting detection statistic is denoted here as y^, the matched-filter output 
for the hh virtual user. Since there are Ky codes, there are Ky such detection statistics, which are collected 
into a column vector y[m] for the mth symbol period. The matched-filter output ydm], for the /th virtual 
user can be written 



y/Nl 
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(5) 



where a is the estimate of a , f, is the estimate of t,„, Ni is the (non-zero) length of code Ciln], and 

Iq i<l ''J 

rjilm] is the match-filtered receiver noise. Substituting r[tj from Equation (3) above gives 
y/M = SZr4Z^'^ •:;TrS^*t"^r + mT]- c;[«] Ujm- /n']+7f,[m] 

m' k=l [^=1 n J 

^=19'=1 I ^^l n p J 

The terms for m'^ 0 result from asynchronous users. 

MUD Algorithm and Functions 

^ ^ A vast number of MUD algorithms have been proposed [1][2], Many of these are too computationally 

ij complex to be implemented with current technology. The linear-iterative class of MUD algorithms 

ly [4][5][6] are the least computationally complex. For this class of algorithms software implementation is 

|i feasible. The hard-decision variants of these algorithms also enjoy a significant performance advantage in 

that they do not tend to amplify other-cell interference. The down side is that performance degrades under 
high input BER. Since channel decoding reduces the BER by orders of magnitude, it is possible to be 
|«5 operating with raw channel BERs as high as 1 0%. A number of methods have been proposed to address this 

ry issue, including the null-zone detector [4], and partial interference cancellation [4][5][6]. We employ 

partial interference cancellation in conjunction with a new thresholding technique which reduces 
computational complexity. Our method provides excellent performance under high input BER. 

The miplementation of MUD at the symbol rate can be divided into two functions. The first function is the 
calculation of the R-matrix elements. The second function is interference cancellation, which relies on 
knowledge of the R-matrix elements. The calculation of these elements and the computational complexity 
are described in the following section. Computational complexity is expressed in Giga Operations Per 
Second (GOPS). The subsequent section describes the MUD IC function. The method of interferaice 
cancellation employed is Multistage Decision Feedback IC (MDFIC)[2][7]. 



R-matrix 

From Equation (5) above, the R-mati-ix calculations can be divided into three separate calculations, each 
with an associated time constant for real-time operation, as follows 
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(6) 



n p 



lis I n, n 

where we have omitted the hats indicating parameter estimates. Hence we must calculate the R-matrices, 
f J which depend on the C-matrices (C/a^^ /m7), which depend on the T-matrix (TdmJ). The F-matrix has the 

D slowest time constant. This matrix represents the user code correlations for all values of offset m. For the 

case of 100 voice users the total memory requirement is 21 MB based on two bytes (real and imaginary 
^0 parts) per element. This matrix is updated only when new codes (new users) are added to the system. Hence 

this is essentially a static matrix. The computational requirements are negligible. The most efficient method 
1^ of calculation depends on the non-zero length of the codes. For high data-rate users the non-zero length of 

the codes is only 4 chips long. For these codes a direct convolution is the most efficient method to 
S^J calculation the elements. For low data-rate users it is more efficient to calculation the elements using the 

\}i FFT to perform the convolutions in the frequency domain. 

Q The C-matrix is calculated fi-om the F-matrix. These elements must be calculated whenever a users delay 

i^l lag changes. For now assume that on average each multipath component changes every 400 ms. The length 

f'^f of the g[] function is 48 samples. Since we are oversampling by 4, there are 12 multiply-accumulations 

(real x complex) to be performed per element, or 48 operations per element. When there are 100 low-rate 
*P users on the system (200 virtual users) and a single multipath lag (of 4) changes for one user a total of 

f3 (1.5)(2)^J^v elements must be calculated. The factor of 1.5 comes firom the 3 C-matrices {m' - -1, 0, 1), 

f Ij reduced by a factor of 2 due to a conjugate symmetry condition. The factor of 2 results because both rows 

and columns must be updated. The factCM: A^v is the number of virtual users per physical user, which for the 
lowest rate users is A^v = 2. Li total then this amounts to 230400 operations per multipath componait per 
physical user. Assuming 100 physical users with 4 multipath components per user, each changing once per 
400 ms gives 230 MOPS. 

The R-matrices are calculated firom the C-matrices. From Equation (6) above the R-matrix elements are 



where ajt are L jc 7 vectors, and CiJm'] are L jc L matrices. The rate at which these calculations must be 
performed depends on the velocity of the users. The selected update rate is 1 .33 ms. If the update rate is too 
slow such that the estimated R-matrix values deviate significantly fi-om the actual R-matrix values then 
there is a degradation in the MUD efficiency. Figure 1 below shows the degradaticMi in MUD efficiency 
versus user velocity for an update rate of 1 .33 ms, whidi corresponds to two WCDMA time slots. The plot 
indicates that there is high MUD efficiency for users with velocity less than about 100 km/hr. The plot 
indicates that the interference corresponding to fast users is not cancelled as effectively as the interference 
due to slow users. For a system with a mix of fast and slow users the resulting MUD efficiency is a average 
of the MUD efficiency for the various user velocities. 
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Figure L MUD efficiency versus user velocity in km/hr 

From Equation (7) the calculation of the R-matrix elements can be calculated in terms of an X-matrix 
which represents amplitude-amplitude multiplies 

/jjm'] = Re^r[af Q [m'] • a, ]}= Re{r[c,, [m'] • a, - ][^ Re{rr[C,,[m'] • X,, ]} 
CJm']=Qj[m']+yCi[m'] 

The advantage of this approach is that the X-matrix multiplies can be reused for all virtual users associated 
with a physical user and for all m' (i.e. = 0, 1). Hence these calculations are negligible when amortized. 
The remaining calculations can be expressed as a single real dot product of length 21? = 32. The 
calculations are be performed in 16-bit fixed-point math. The total operations is thus L5(4)(KvL)^ = 3.84 
Mops. The processing requirement is then 2.90 GOPS. The X-matrix multiplies when amortized amount to 
an additional 0.7 GOPS. The total processing requirement is then 3.60 GOPS. 



MDFIC 

From Equation (5) above the matched-filter outputs are given by 

y,[m] = r,[0]^>,[m] + XrJ-l¥.Jm + l] + |;[/}JO]-r„^^^^^^ (9) 

The first term represents the signal of interest. All the remaining terms represent Multiple Access 
Interference (MAI) and noise. The MDFIC algorithm iteratively solves for the symbol estimates hj[m\ 
using 
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[ k=l k=l *=» J 



(10) 



with initial estimates given by hard decisions on the matched-filter detection statistics, bf[m] = 5/gn{y,[m]}- 
The MDFIC [7] technique is closely related to the SIC and PIC technique. Notice that new estimates 
are immediately introduced back into the interference cancellation as they are calculated. Hence at any 
given cancellation step the best available symbol estimates are used. This idea is analogous to the Gauss- 
Siedel method for solving diagonally dominant linear systems. 

The above iteration is performed on a block of 20 symbols, for all users. The 20-symboI block size 
represents two WCDMA time slots. The R-matrices are assumed to be constant over this period. 
Performance is improved under high input BER if the sign detector in Equation (10) is replaced by the 
hyperbolic tangent detector [6]. This detector has a single slope parameter which is variable from iteration 
to iteration. 

The three R-matrices (R[-l], R[0] and R[l]) are each KyxKy in size. The total number of operation dien is 

6Ky per iteration. The computational complexity of the MDFIC algorithm depends on the total number of 
i^: virtual users, which depends on the mix of users at the various spreading factors. For Ky = 200 users (e.g. 

100 low-rate users) this amounts to 240,000 operations. In the current implementation two iterations are 
J£ used, requiring a total of 480,000 operation. For real-time operation these operations must be performed in 

1/15 ms. The total processing requirement is thai 7.2 GOPS. Computational complexity is markedly 
W reduced if a threshold parameter is set such that IC is performed only for values |y//m7| below the threshold. 

%D The idea is that if \yj[m]\ is large there is little doubt as to the sign of bi[m], and IC need not be performed. 

Ifl The value of the threshold parameter is variable from stage to stage. 

CO 

111 Mapping to Hardware 

s 

Q The above calculations are performed on a single 9"x6" card populated with four Power PC 7410 

III processors. These processors employ the AltiVec SIMD vector aridimetic-logic unit, which has 32 128-bit 

|T vector registers. These registers can hold eidier 4 32-bit floats, 4 32 bit ints, 8 16-bit shorts, or 16 8-bit 

chars. Two vector SIMD operation (multiply and accumulate) can be performed by clock. The clock rate 

used for the current implementation is 400 MHz. The processors, however, can be operated at 500 MHz 
13 with higher clock speeds in the near future. Each processor has 32KB of LI cache and 2MB of 266MHz L2 

fU cache. The maximum theoretical performance of these processors is thus 3.2 GFLOPS, 6.4 GOPS (16-bit), 

or 12.8 GOPS (8-bit). The current implementatitxi used a combination of floating-point, 16-bit fixed-point 

and 8-bit fixed-point calculations. 

The four PPC7410 processors are interconnected with a RACE-I-+ 266MB/s 8-port switched fabric as 
shown in Figure 2. The high bandwiddi fabric allows transfer of large amounts of data with very low 
latency so as to achieve efficient paraDelism of the four processors. The maximum theoretical performance 
of the card is thus 5 1 .2 GOPS . 
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Figure 2. Partitioning of MUD functions across four processors 

As shown in Hgure 2 the MDFIC and C-matrix calculations are allocated to a single processor. The other 
three processors are given to the R-matrix calculations which are considerably more complex. 

MUD BER Performance 



y 
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A sample of the Bit Error Rate (BER) performance of the MUD algorithm is shown in Figure 3. For 
comparison the matched-filter BER is also shown. The figure shows that MUD doubles system capacity. 
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Figure 3. Log 10 bit error rate versus system capacity for matched 
filter (blue ) and multiuser detection ( red) 



The above performance is based on the following assumptions: 

• A single receive antenna is used 

• The target BER is 0.001 
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• The percentage of systems users in handof f is 30% 

• Other-cell interference is 35% of intra-cell interference. This is lower than the typical value (0.60) 
used. The reason is that the other-cell users in handoff with the cell of interest are included in the intra- 
cell interference. This is because the cell of interest is processing these users and hence can cancell 
there interference using MUD. 

• A 4-tap multipath channel is used. Each tap is Rayleigh fading. The composite power of all paths is 
perfectly power controlled. 

• The channel amplitude estimation error is 10% 

• The channel delay estimation is Va chip 

• The activity factor for voice is 0.40 

• The relative amplitude of the control channel is % = 0.5333 



Conclusions 

The current state of processor technology is such that iterative hard-dedsion MUD for the UMTS uplink 
can be implemented in software on a single card or dau^ter card populated with four Power PC 7410 
processors, connected together with a high-bandwidth RACE++ int^connect fabric. The use of short codes 
allows MUD to be performed at the symbol rate. The advantage of symbol-rate processing is that MUD can 
13 be introduced into a BTS as an enhancement to the conventional RAKE receiver. The MUD processing 

\Cl takes the MF detection statistics, performs interference cancellation, and then delivers improved hard or 

^ soft-decision symbol estimates to the symbol-rate BTS processing ftmctions. The latency introduced is only 

m a few milliseconds. In order to paform MUD at the symbol rate the R-matrices must be updated in real 

f% time. There is a minimal degradaticHi in MUD efficiency if these elements are updated at a rate of once per 

1.33 ms. The R-matrices are used to cancel the multiple access interference throu^ the MDFIC 
^ interference cancellation technique. At a BER of 0.001 the use of the above MUD technique doubles 

* system capacity. 
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1 . Introduction 



This report briefly describes long-code Multi-User Detection (MUD). Section 2 describes 
the long-code signal model, which is different from the short-code model. Section 3 
describes the matched-filtering operation for long codes and gives a lower bound on the 
GOPS required for long-code symbol-rate MUD. The lower bound is 19.7 TOPS (i.e. Tera 
Operations Per Second; 1 TOPS = 1000 GOPS). Because of the extreme computational 
complexity of symbol-rate MUD for long codes regenerative MUD is examined. It is shown 
|!t: in Section 4 that although regenerative MUD operates at the chip rate, the overall 

I J' complexity is lower for long codes. Two methods are examined. The first method is a 

somewhat straight-forward implementation of regenerative MUD. The required 
computational complexity is shown to be 774.6 GOPS for 100 users. The second method 
|ip is based on combining impluse trains and subsequently raised-cosine filtering the 

composite signal. The total computational complexity is shown to be 109.6 GOPS for 100 
users. Regenerative MUD is linear in the number of users, so that if the number of users 
is reduced to 64 the complexity drops to 70.1 GOPS. The complexity is also linear in the 
number of multipaths subtracted, so that if the number of multipaths subtracted is reduced 
from 4 to 2 the complexity drops to 35.1 GOPS. It may be desirable for MUD performance 
to subtract only the two largest multipaths due channel amplitude estimation errors. The 
above complexity figures are for a single interference cancellation stage. For two stages 
the computation is doubled. To perform regenerative MUD the baseband antenna stream 
data must be brought onto the MUD board. The required bandwidth is 123 MB/s. Note that 
the figures given above can perhaps be reduced through a clever implementation. A block 
diagram of regenerative MUD is shown to facilitate an investigation into the feasibility of 
an FPGA or ASIC implementation. 



2. Signal Model 
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The received signal model for short-code 

used the signal model is different since 

symbol. We present here the WCDMA 

received signal can be written 



WCDMA is given in [1]. When long codes are 
effectively the codes change from symbol to 
signal model for long codes. The baseband 



2K 



m = YIj^it-mTMm\+M.n (1) 

where t is the integer time sample index, Tk = NkNc is the data bit duration, which depends 
on the user spreading factor, Nk is the spreading factor for the /cth virtual user, Nc is the 
number of samples per chip, Kis the total number of physical users, w[t]\s receiver noise, 
and where Sj^[t] is the channel-corrupted signature waveform for the /cth virtual user over 
the mth symbol period. The concept of virtual users is used to account for both the 
DPDCH and the DPCCH. Hence if there are K physical users, then there are Kv = 2K 
virtual users. The user signature waveform and hence the channel-corrupted signature 
waveform vary from symbol period to symbol period since long codes by definition extend 
over many symbol periods. For L multipath components the channel-corrupted signature 
1*^ waveform for virtual user k is nrx)deled as 



,m=X«*,^*.[^-T^] (2) 



where a^p are the complex multipath amplitudes. The amplitude ratios pk are incorporated 
y into the amplitudes a^p. Notice that if k and / are virtual users corresponding to the 

a" DPCCH and the DPDCH of the same physical user then, aside from scaling the by 

O and pi, Skp and are equal. This is due to the fact that the sig nal waveforms for both the 

ill DPCCH and the DPDCH pass through the same channel. 

*P The waveform Skm[t] is referred to as the signature waveform for the kth virtual user over 

O the nth symbol period. This waveform is generated by passing the spreading code 

ro sequence C/c,n/n7 through a pulse-shaping filter g[t] 

(3) 

= ^g[t'rNMr-^rnN,] 

r=0 



where g[t] is the raised-cosine pulse shape. Since g[t] is a raised-cosine pulse as 
opposed to a root-raised-cosine pulse, the received signal r[t] represents the baseband 
signal after filtering by the matched chip filter. 



3. Matched filter 

jhe received signal above, which has been match-filtered to the chip pulse, must next be 
match-filtered by the user code-sequence filter. The resulting detection statistic is 
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Page '^^^''^^^^ ^^^^ ^^^^j^ matched-filter output for the kth virtual user over the mth 
symbol period. Since there are Kv codes, there are Kv such detection statistics, which are 
collected into a column vector y[m]. The matched-filter output yim], for the Ith virtual user 
can be written 



(4) 



where is the estimate of a^^ , f^^ is the estimate of t,^, and riim] is the match-filtered 
receiver noise. Substituting r[t]\xom Equation (1) above gives 



« 
w 

u 
m 



y,[m]s 



iL 1 N,-\\lK L 



=f.A^tK^<^ ■^^J.M'^' 



2K \ L L 



n=0 



&Jm'][ + J7,[m] 



m- *=1 1 4=1 l'=I J 



C,kgM, m'] = :^^s^ [nN, + T^^ [m, m' ]] • 4 [n] 



/ n=0 



77,[/n]sRe X< ~rZ^«^c +-^1, +mT,] cl[n]\ 

[q=l ^^l n=0 J 



(5) 



In order to subtract interference we must, at a minimum, calculate C{kqp[m,m'] for all virtual 
users and for all multipath components. A lower bound on the computational complexity 
can be determined by considering the above calculations for synchronous users. For 
synchronous users, all at the highest spreading factor, the required number of operations 
to calculate Cikqp[m,m'] is 8{256){2KLf = 1.31 Gops for K = 100 and L = 4. For real time 
operation 15000 such computations must be performed every second. This amounts to 
19.7 TOPS (i.e. Tera Operations Per Second). 



4. Regenerative MUD 
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Because of the extreme computational complexity of symbol-rate MUD for long codes it is 
advantageous to resort to regenerative MUD when long codes are used. Although 
regenerative MUD operates at the chip rate, the overall complexity is lower for long 
codes. For regenerative MUD the signal waveforms of interferers are regenerated at the 
sample rate and effectively subtracted from the received signal. A second pass through 
the matched filter then yields improved performance. It turns out that the computational 
complexity of regenerative MUD is linear in the number of users. 

The received signal can be written 

2K L 

4?] = X X £ a^s^ [? - mT, [m] + w[/] 

-f,r,[t] + m (6) 

L 

m p=l 

Subtracting interference gives a cleaned-up signal xff] 



IK 



(7) 



u 

=r\t\-m+m 

L 

m p=\ 

Two methods are presented below for performing regenerative MUD. 



First Method 

In order to subtract interference we must reconstruct (regenerate) the waveform Skm[t] as 
given in Equation (3). The waveform can be reconstructed using 
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Si^[t]=J^g[t-rNJc^lr] 

r=0 



= ''f,''tglt-i4p + j)NM4p + j] 

p=Q y=o 

/4-1 

= I,''^[t-4pNJ (8) 

] ^ S - - JN, [j] 

The idea is that Skm[t] can be represented as a sumnnation of shifted waveforms Skmp[t], 
1^^ which are entirely specified by the 8 binary numbers comprising the complex sequence 

Q CkmpO] of length 4. Hence there are only 2^ = 256 such waveforms. For what follows we 

$3 assume that the signals are sampled at A/c = 8 samples per chip. Each is of length 96 + 

'%Q 3(4) = 108 samples assuming that g[t] is of length 96. For 2 bytes per sample (real and 

j| imaginary parts) the total memory requirement is 216*256 = 55296 bytes, which spills out 

of LI cache, but fits entirely in L2 cache. 

'^^^ To generate rjr] for a single symbol period, 64 of these waveforms must be read from 

Q memory. For each of these 64 waveforms L complex macs are required per sample per 

IJ symbol period. Hence 64(8i.)(108) operations are required per symbol period. For L = 4 

li this amounts to 64(32)(108) = 221184 operations per symbol period (1/15 ms), or 3.32 

Lp; GOPS. The formation of Wt/then requires 2/C times this, or 3.32(200) = 664 GOPS for K 

13 = 100 physical users. To form r^M + r^^.W requires an additional 2(96+255*4) = 2232 

operations per symbol period per virtual user, or another 6.7 GOPS. Finally, the matched 
filter operation needs to be performed for each user, which from Equation (4) requires 
A/Z./C complex macs (A/ = 256), or 256(4)(100)(8)*15000 - 12.3 GOPS. The GOPS figures 
above are for a single antenna. For two antennas the operations are doubled. Hence the 
total computational complexity is 2(664 + 6.7 + 12.3) = 1.37 TOPS. This is for a single- 
stage MPIC algorithm. For two stages the computation is doubled. 

To perform regenerative MUD the baseband antenna stream data must be brought onto 
the MUD board. The required bandwidth is 

[2 Bytes(complex)/Sa/Ant][2 Ant][8 Sa/chip][3.84 Mchips/ second] = 123 MB/s 



Second Method 

The second method is to represent the waveform for each multipath for each user as a 
complex impulse train with A/c = 8 samples per impulse. The complex amplitude of each 
impulse is the product of the complex chip, complex multipath amplitude and the binary 
(real) data bit estimate. These 2KL complex streams (times 2 for 2 antennas) are added to 
form a composite signal. Since this composite signal is a sum many impulse trains, all 
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asynchronous, the composite signal is a dense (i.e. no systematic zeros) signal at the 

sample rate. A block diagram of the processing is shown in Figure 1 , 
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Figure 1. A block diagram of the long-code MUD processing 
From Equations (7) and (8) 

IK 



2K L 



2K L 



2K L Ni -I 



2K L 

= XZ-^^P X sit -f^ - tiN, ]c, [n\b, [InlN, J 

k=l p=\ n 

2K L ^11 

= I.2^>^mLgir]5[t-r-f^-nNJcdn]b,[ln/Nj 

k=l p=l r n 

2K L 

= J,J,g[r]tj,8[t-r-'e^-nN^ya^c,[n]b,[ln/Nj 

it=I r p=l n 

= ^glr]a[t-r] 

r 



(9) 



where a[t] is the composite signal. For each symbol period this requires 256(1 0)(2KL) 
operations per antenna. For two antennas this amounts to 5120(200)(4) = 4096000 
operations per symbol period, or 61 .4 GOPS. 
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The estimate of the received signal is then determined by passing the composite signal 

through the raised-cosine filter g[t]oi length 96, which requires 96 real macs, or 192 real 

operations, per sample per real stream. There are a total of 4 real streams (2 antennas, 

real and imaginary streams). The total GOPS then for Nc = 8 samples per chip is 

1 92(4)(8)(3.84M) = 23.6 GOPS. 

The final step is to pass the cleaned-up signal [?] = ;}[/] + r^,Jf] through the matched-filter 
(i.e. rake receiver) which gives the improved detection statistic 
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= Re 
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1 N,-t 

— "S 

2N, 
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" L L 



2N 



I n=0 



m q=l 
L 

N,-l 



\b,[m] + yZ[m\ 



= Re|ta««,,|fc, 



= Afb,[m] + yl'JJm] 



(10) 



The matched filter operation requires NLK complex macs, or 256(4)(100)(8)*15000 = 12.3 
GOPS. The GOPS figures above are for a single antenna. For two antennas the 
operations are doubled, giving 24.6 GOPS. The total computational complexity for the 
second method is then 61 .4 + 23.6 + 24.6 = 109.6 GOPS. 
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To: Wireless Communications Group 
From: J. H. Gates 

Subject: R-matrix GOPS Date: June 21 , 2000 

1. Introduction 

j| This report investigates a number of different methods for calculating the R- 

^ matrix elements. There are two parts to the calculation. First is the calculation of 

|!| the user code correlations at lag offsets determined by the searcher receivers. 

l^' This calculation must be performed every time a multipath component changes 

to a new lag. The assumption used here is that every 100 ms one multipath 
component changes to a new lag for each user. Hence, if each user has 4 
multipath lags, then all R-matrix elements will have changed after 400 ms. The 
validity of this assumption will have to be tested with measured data. Note that 
p5 the WCDMA standard call out a test with 2 multipath components, where one lag 

^ changes every 191 ms [1]. The second part is the actual calculation of the R- 

inatrix elements, which requires a double summation of code correlations over all 
multipath components, with each term scaled by the Rayleigh-fading multipath 
amplitudes. The maximum time period to perform this calculation is about 1.33 
ms. Hence there are two parts to the calculation, each with a different update 
rate. 

Section 2 is devoted the first part of the calculation, the code correlations. 
Section 3 covers the actual calculation of the Rmatrix elements. 



1 
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2. Calculation of User Code Correlations 

The R-matrix elements can be expressed as [2] 



l^s: where Cikqq' [m'J is a five-dimensional matrix of code correlations. Both / and k 

f3 range from 1 to Kv, where Kv is the number of virtual users. If there are K physical 

p users, all operating at the highest spreading factor, then there are Kv = 2/C virtual 

3 users. For now consider K = 128 so that Kv = 256, The indices q and q' range 

from 1 to L, the number of multipath components, which for this report is 
assumed to be equal to 4. The symbol period offset m' ranges from -1 to 1 . The 
total number of matrix elements to be calculated is then 
= 3iK^Ly = 3(1024)^ 3M complex elements, or 24 MB if each element is a 
float. This number is reduced, however, due to the symmetries 



2N, „ , 



k p n 



= TTirS E - P^^c +"i'T+ % -f^. ]c: [p\ c, [n] 



(2) 



k 



SO that it is sufficient to store elements for offsets m' = 0,1. The memory 
requirement is then 16 MB if each element is a float. If the elements are stored 
as bytes the requirement is reduced to 4 MB. 
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%0 



Referring to Equation 1, line 2, it is evident that each element of Cikqq' [m'J is a 
complex dot product between a code vector C/ and a waveform vector Sfcqq'. The 
length of the code vector is 256. The length of the waveform vector is Lg -h 255Nc, 
where Lg is the length of the raised -cosine pulse vector g[t] and No is the number 
of samples per chip. The values for these parameters as currently implemented 
are Lg = 48 and Nc = 4. The length of the waveform vector is then 1068, but for 
the dot product it is accessed at a stride of Nc = 4, which gives effectively a 
length of 267. Note that the code and waveform vectors in general do not entirely 
overlap. Also note that an increment or decrement in the symbol offset index m' 
slides the waveform vector 256 elements to the left or right respectively. Figure 1 
shows that the total number of complex macs (cmacs) for all three (m'= -1,0, 1) 
dot products is 267, irrespective of any relative offset. 





3 























Figure 1. Overlap of waveform arid code vectors. The total 
^ number of complex macs (cmacs) for all three (m' = -1,0, 1) 

1% dot products is 267, irrespective of any relative offset 



Hence for any given combination of indices Ikqq' the three elements Q^qq' [m'J, 
corresponding to m'= -1, 0 and 1 require 267 cmacs to calculate all three. Since 
there are (KvLf combinations of indices, the calculation of all elements Qkqq' [m'} 
requires (KvLf (267) cmacs. Given the symmetry condition, only half of the 
f4 elements need to be calculated, and noting that each cmac requires 8 operation 

fJS to perform, the total number of operations required is 

iV,^, = j(i^,L)^(267)(8) =i(i024)^(267)(8) = 1.12 G ops (3) 

The total number of GOPS (Giga Operations Per Second), then, given the 400 
ms update rate is 

^ i(i..L)-(267)(8).;.. ^ ia024)-(267)(8).p. ^ 
''^'^^ 400ms 400ms ^ ^ 

The next section addresses the calculation of the R-matrix elements. 
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b 



3. Calculation of R-matrix Elements 

Consider the calculation of the R-matrix elements 

PMIA, = XiRefe;^..- • C^kqqW} (5) 

The total number of matrix elements to be calculated is = 3K^ . This number 
is reduced, however, due to the symmetries 



(6) 



Ij! so that the total number of matrix elements to be calculated is N ^=\Kl 

m 



Now let us consider the operations per element. Dropping explicit reference to 
the symbol period offset [m], the matrix elements are 

p/.^.=ti;Re^;-^..-Q,j (7) 

9=1 9=1 

A brute-force calculation requires L^(6 + 3 + 1) operations {1 complex multiply, 
one half-complex multiply - i.e. the real part ~ and one real add, or 6 real 
multiplies and 4 real adds). The total operations is then 

N^^,=\{K^L)\10) (8) 

For a vehicular speed of 120 km/h the Doppler frequency is 216,67 Hz for a user 
at frequency 1950 MHz. The coherence bandwidth is thus 433.33 MHz, and the 
corresponding coherence time is about 2.3 ms. Hence the multipath amplitudes 
are changing with a time constant of about 2 ms, and consequently the second 
part of the calculation must be updated at least every 2 ms. The channel 
amplitudes are calculated on a time slot by time slot basis. Each time slot is 
10/15 = 2/3 = 0.67 ms. Hence 2 ms equals 3 time slots, whereas two slots equals 
1.33 ms. Figures 2 and 3 below show the MUD efficiency versus user velocity for 
2 ms and 1 .33 ms update times respectively. The plots show that to be able to 
effectively handle high velocity users the update time should be 1 .33 ms. When 
users are at various speeds the interference from low speed users is cancelled 
more effectively than the interference from high speed users. The MUD efficiency 
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will then be an average of the MUD efficiency corresponding to each user's 
speed. 



Calculations updated every 2 ms (3 time slots) 
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Figure 2. MUD efficiency versus user velocity for a 
2 ms R-matrix update time. 
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Calculations updated every 1.33 ms (2 time slots) 
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Figure 5. MUD efficiency versus user velocity for a 
1,33 ms R-matn'x update time. 



The calculations below are based on a 1.33 ms update time. Note that nrrost of 
the capacity and coverage benefits calculated for MUD so far have assumed 
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70% MUD efficiency. The 1.33 ms update time is sufficient to achieve 70% MUD 
efficiency. The total GOPS are then, 



_ i(K,Lf (10) _ 1,5(256-4)^(10) 



1 33 ms 



133 ms 



= 1L8 GOPS 



(9) 



where we have assumed L = 4 multipath components. A better way to perform 
this operation is 



q=l [ q^l J 



(10) 



The inner sum is a matrix-vector multiply, hence requiring cmacs, and the 
outer sum is the real part of a compex dot product, which requires L half-cmacs. 
The total is then (L^ + U2) = 1.125 cmacs (for L = 4) times 8 operations per 
cmac, or 9Z.^ operations, which gives 
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_ |(/i:,L)^(9) _ 1.5(256 -4)^(9) 



1.33 ms 



1.33 ms 



10.6 GOPS 



(11) 



The above calculations are represented in terms of complex numbers, which are 
not directly calculable. To express the above equations explicitly in terms of real 
numbers it is convenient to cast the calculations into matrix form 



L L r 1 



= Re 



C C 



C C 



(12) 
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The quadratic fomn al^Cikak can be expressed 
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|t*5r 



m 
m 



13 



= Re-^ 



ale A -alC,b, + afCA + afC,b, ] 



b, 



+ J 



C, 



1 



= C b 



(13) 



The matrix-vector multiplication requires (2Lf macs. The dot product adds (2L) 
macs so that the total is (2Lf + (2L) macs. For L = 4 we have /. 125(2Lf macs = 
4.5L^ macs = operations. The total GOPS are then 



_|(^:,Lf(9)_ 1.5(256-4)^(9) 



1.33 ms 



1.33 ms 



= 10,6 GOPS 



(14) 



Now consider a different formulation which attempts to reuse the amplitude- 
amplitude multiplications. Consider the calculation C b 



'C'b = tr[a' C'b]^tr^ {ba^)]^tr[C'X] 
X ^ba'' 



(15) 



The calculations to produce matrix X are pure multiplications, but the elements, 
once calculated, can be reused for the other virtual users corresponding to the 
same physical users. For voice-only users there are 2 virtual users per physical 
user. For data users there can be up to 65 virtual users per physical user. For 
now, however, we stay with our 128 voice-user scenario. To calculate X, then, 
requires (2Lf = 4L^ multiplications. This calculation is performed once per pair of 
physical users, so the total number of operations is 



(16) 
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Effectively, then, X requires (2/3)L^ operations. The details to calculate C b 



are 



a"^ C'b = tr[C'X]=tr 





[xj x^J 




■< 













2L 



(17) 



o 

IS 
lip 



4 :: 

1^ 



where d is the Ah row of C and x, is the Ah column of X. Hence we have 2L dot 
products of length 2L, which require (2Lf macs = 8L^ operations. To calculate 
a" C'b then requires 8L^+ (2/3)L^ = 8.67L^ operations, which gives 



133 ms 1.33 m 

A better way to perform this calculation is as follows 



L L 



4=1 ?•=! 



(18) 



fli 



9=1 q'^\ 



(19) 



99 9 9 9 9 



where for convenience we have dropped Ak, the Ik subscripts and the hat 
symbols. The calculation of X requires 

No,s = {KL)\6) = {K^Lf{6IA) =\{K^Lf{\) (20) 

operations. Note that, once the X values are calculated, the remainder of the 
calculation is a long dot product of length 2L^, hence requiring 2L^ macs, or 4L^ 
operations, which gives 
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Nao. =i<^^ = l:5^^^^:^ = 5.9GOP5 (21) 

^"""^^ 133 ms 133 ms 



Dual Diversity Antennas 

When dual diversity antennas are employed, the calculation of the R-nnatrix 
elements becomes 

q=l q=l q=l q'=i 

f^z q=i q-i 

m (22) 

9=1 q'^l 



^qq' -^lqq''^^2qq' 



5^ J To calculate Xfor dual diversity antennas, then, requires 

P iV^ =iKL)Hl4)^{K^L)\l4/4) = ^K^L)\7/3) = ^(K,L)\233) (23) 

operations. The remainder of the calculation is again a long dot product of length 
2L^ requiring 4/.^ operations, which gives 

^|(i..L)-(6.33)^1.5(256-4)-(6.33)^^3^^^^ 
133 ms 133 ms 



Reuse of C data 

So far we have not addressed the problem associated with a lack of data reuse, 
which renders our calculations I/O limited. The C data can be reused by 
introducing extra latency into the calculations. For a given user, a single 
multipath component changes on average once every 100 ms, or once every 150 
slots. Suppose we collect and save in cache 4 amplitude estimate vectors anlq]. 
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where q is the 2 ms update index. The total latency is then 8 ms = 12 time slots. 
During this time the probability that a multipath lag changes is (8 ms)/(1 00 ms) = 
.08. The probability that the matrix changes is then = 1 -(1-0.08) = 0.15. 
Hence for most matrices Cft we will be able to calculate 



for 12 time slots q for only one read of from memory. The penalty for this 
reuse is the 8 ms of latency incurred. 
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To: John Gates, John Greene, Alden Fuchs, Frank Date: 3 l-AUG-2000 

Lauginiger 

From: Mike Vinskus 

Subject: Theoretically optimum load balancing for the R File Ref: mjv-9.doc 
matrix calculations 



This memo describes the calculation of optimum R matrix partitioning points in 

normalized virtual user space. These partitioning points provide an equal, and hence 
5 J! balanced, computation load per processor. The computational model of the R matrix 

I calculations does not include any data access overhead or caching effects. It is shown 

that a closed form recursive solution exists that can be solved for an arbitrary number of 

processors. 

III 

1^. Although three R matrices are output from the R matrix calculation function, only half of 

the elements are explicitly calculated. This is due to the symmetry condition that exists 
m between R matrices: 

In essence, only two matrices need to be calculated. The first one is a combination of 
R(l) andR(-l). The second is the R(0) matrix. In this case, the essential R(0) matrix 
elements have a triangular structure to them. The number of computations performed to 
generate the raw data for the R(l)/R(-1) and R(0) matrices are combined and optimized as 
a single number. This is due to the reuse of the X matrix outer product values across the 
two R matrices. Since the bulk of the computations involve combining the X matrix and 
correlation values, they dominate the processor utilization. These computations are used 
as a cost metric in determining the optimum loading of each processor. 



The optimization problem is formulated as an equal area problem, where the solution 
results in each partition area to be equal. Since the major dimensions of the R matrices 
are in terms of the number of active virtual users, the solution space for this problem is in 
terms of the number of virtual users per processor. By normalizing the solution space by 
the number of virtual users, the solution is applicable for an arbitrary number of virtual 
users. 
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Figure 1: Normalized R matrix computation model. 

Figure 1 shows the model of the normalized optimization problem. The computations for 
the R(l)/R(-1) matrix are represented by the square HJKM, while the computations for 
the R(0) matrix are represented by the triangle ABC. From geometry, the area of a 
rectangle of length b and height h is 

=bh. 

For a triangle with a base width b and height A, the area is calculated by 

A=^bh, 
' 2 

When combined with a common height a,, the formula for the area becomes 

1 2 

1 2 

' 2 ' 

The formula for A, gives the area for the total region below the partition line. For 
example, the formula for A2 gives the area within the rectangle HQRM plus the region 
within triangle AFG. For the cost function, the difference in successive areas is used. 
That is 

B,=A,-A,_, 



1 2 

= — a. +a. 

2 ' 
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s .5 



For an optimum solution, the Bi must be equal for i=l,2,„.,N, where is the number 
of processors performing the calculations. Because the total normalized load is equal to 
An, the loading per processor load is equal to An IN. 



^^^^^ J_ ^ 
' N 3 2N' 



By combining the two equation for 5„ the solution for aj is found by finding the roots of 
the equation: 

2 ' '2 2N 

The solution for a, is: 



' +2a,_,+— ,fori=l,2,...,iV. 



Since the solution space must fall in the range [0, 1], negative roots are not valid 
solutions to the problem. On the surface, it appears that the a, must be solved by first 
solving for case where z = 1 . However, by expanding the recursions of the a, and using 
the fact that ao equals zero, a solution that does not require previous at , / = 0, 1, n-l 
exists. The solution is: 



V N 



Table 1 shows the normalized partition values for two, three, and four processors. To 
calculate the actual partitioning values, the number of active virtual users is multipUed by 
the corresponding table entries. Since a fraction of a user cannot be allocated, a ceiling 
operation is performed that biases the number of virtual users per processor towards the 
processors whose loading function is less sensitive to perturbations in the number of 
users. 

Table 1: Normalized partition locations for two, three, and four processors. 







^^^^^ 




^^^^^^^^^ 


^^^^^^^^^^^^ 


aj 




1 (0.5811) 


-I + V2 (0.4142) 




(0.3229) 


ai 




-I + V3 (0.7321) 




(0.5811) 


as 








(0.8028) 
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To: Jonathan Schonfeld Date: 23-FEB-2001 
From: Nmf 

Subject: Degraded mode of operation for the MUD File Ref: mjv-01 8- 

algorithm degraded_mode_desc.doc 



P Reference [ 1 ] showed that the load balancing for the R matrix calculations resulted m a non -uniform 

partitioning of the rows of the final R matrices over a number of processors. In summary, the 
%D partition sizes increase as the partition starting user index increases. 

When the system is running at full capacity (i.e. the maximum number of users is processed while 
^ J still within the bounds of real-time operation ) and a computational node has a failure, the impact can 

J^. be significant. 

^ This impact can be minimized by allocating the first user partition to the disabled node. Also the 

values that would have been calculated by that node are set to zero. This reduces the effects of the 
!p failed node. Also, by changing which user data is set to zero (i.e. which users are assigned to the 

1,^ failed node ) the overall errors due to the lack of non -zero output data for that node are averaged 

^ over aU of the users, providing a "soft" degradation. 

References: 

[1] M. Vinskus. "mjv-009: Theoretically optimum load balanciag for the R matrix calculations." 
31-AUG-2000. 

[2] M. Vinskus. "mjv-010: Preliminary degraded MUD operation results." 19-OCT-2000. 
[3] J. Gates, "jho-001: MUD Algorithms", 25 -APR-2000 
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To: Wireless Communications Group 
From: J. H. Gates 

Subject: Methods for Calculating the C-matrix Elements Date: November 13, 2000 
^ 1. Direct Method 

Cfl The direct method for calculating the C-matrix elements is 

y 



L L 

(1) 



p 1 



Symmetry 



Due to symmetry there are 1,5(KyLf elements to calculate. Assuming all users 
are at SF 256, each calculation requires 256 cmacs, or 2048 operations. The 
probability that a multipath changes in a 10 ms time period is approximately 
10/200 = 0.05 if all users are at 120 kmph. Assuming a mix of user velocities, 
lefs say the probability is 0.025. Since the C-matrix elements represent the 
interaction between two users, the probability that C-matrix elements change in a 
10 ms time period is approximately 0.10 for all users are at 120 kmph, or 0.05 for 
a mix of user velocities. The GOPS are tabulated in Table 1 below. 
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The C-matrix elements also need to be updated when the spreading factor 
changes. The spreading factor can change due to 

• AMR codec rate changes 

• Multiplexing of DCCH 

• Multiplexing data services 

For lack of a better number, assume that 5% of the users, hence 10% of the 
elements change rate every 10 ms. 



Table 1. GOPS to update C-matrix elements using the direct method. 



Kv 


High velocity 
users 


1.5(KyLf 


Gops 


Percentage 
change 


GOPS 


200 


100% 


960,000 


1.966 


20 


39.3 


200 


50% 


960,000 


1.966 


15 


29.5 


128 


100% 


393,216 


0.805 


20 


16.1 


128 


50% 


393,216 


0.805 


15 


12.1 



C4i- 

w 
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2. FFT Method 

The FFT can be used to calculate the correlations for a range of offsetST using 

ZiV J „ 

= Q[Ttt,,.[m']] 

(3) 

2iV J „ 

I* The length of the waveform Sk[t] is Lg + 255Nc = 1 068 for Lg = 48 and Nc = 4. This 

r| is represented as Nc wavefornas of length L^Nc + 255 = 267. 

n 

3 One advantage of this approach is that elements can be stored for a range of 

%i offsets X so that calculations do not need to be performed when lags change. For 

delay spreads of about 4jis 32 samples need to be stored for each m'. 

m 
w 

s 

m 

£ 
m 
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3. Using Code Correlations 



The C-matrix elements can be represented in terms of the underlying code 
correlations using 



5 



^•>">M'^ ^^ll'dnN, +m'T +r\ -t\.]- c;[n] 



2N, „ , 



= -^^2L§[ffiiV^ +T]cJn-m]c;[«l 

= ^g[mN^+T]~-^c;[n]c,[n-m] (4) 

m ' I n 



W 1 ^ . 

eg ' " 



If the length of g[t] is Lg - 48 and Nc = 4, then the summation over m requires 
W 48/4 =12 macs for the real part and 12 macs for the imaginary part. The total ops 

is then 48 ops per element. (Compare with 2048 operations for the direct 
*P method.) Hence for the case where there are 200 virtual users and 20% of the C- 

1^ matrix needs updating every 10 ms the required complexity is (960000 el)(48 

fi^ ops/el)(0.20)/(0.010 sec) = 921.6 MOPS. This is the required complexity to 

compute the C-matrix from the r-matrix. The cost of computing the r-matrix must 
also be considered. There is reason to hope that the r-matrix can be efficiently 
computed since the fundamental operation is a convolution of codes with 
elements constrained to be +/-1 



The r-matrix elements can be calculated using 

• the FFT 

• Modulo-2 arithmetic 

• Hardware XOR 

• Short-code generator(?) 
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4. Using Fundamental Correlations 



The waveform Sk[t] can be decomposed into fundamental waveforms 
corresponding to 4-chip segments of the corresponding complex user codes. 
There are 2 = 256 such waveforms. Each of these can be correlated with 
another 256 possible 4-chip code segments. For each correlation there are about 
64 offsets that produce a non-zero correlation. Hence all correlation calculations 
can be represented in terms of 256(256)(64) = 4M fundamental complex 
correlations. The C-matrix elements are then 

13 1 
m Q ^ :;Tr [^^^ + ' < w 



63 63 1 3 

/=0 ;=50 I n=0 

63 63 



63 63 



7=0 J=0 



Using the above, each C-matrix element requires 64(64) = 4096 complex adds, 
or 8192 operations to calculate. (Compare with 2048 operations for the direct 
method.) 

Alternately, the calculations can be represented in terms of 4-chip real code 
segments and the corresponding waveforms. Hence all correlation calculations 
can be represented in terms of 16(16)(64) = 16K fundamental real correlations. 
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Bepon 

To: Wireless Communications Group 
From: J. H. Oates 

Subject: Calculation of C-matrix Elements Date: August 10, 2000 

1. Introduction 



The C-matrix elements are used to calculate the R-matrices, which are used by the MDF 
^ interference cancellation routine. Each C-matrix element can be calculated as a dot 

|5 product between the /cth user's waveform and the Ah user's code stream, each offset by 

J J some multipath delay. For this method of calculation, each time a user's multipath profile 

changes all C-matrix elements associated with the changed profile must be recalculated. 
It is estimated that a user profile changes every 100 ms. This number, however, is based 
on very little data, and there is considerable risk that profiles may change more rapidly 
and compromise real-time operation. In addition, there is a large amount of overhead that 
must be performed before each dot product. In a recent benchmark the overhead 
consumed nearly all of the time allocated for the entire C-matrix update. Finally, if the C- 
y matrix is calculated as described above then an entire processor must be allocated for 

this calculation. 

In view of the above observations a better approach is to pre-calculate the code 
correlations up-front when a user is added to the system. This calculation is performed 
over all possible code offsets and the calculations are stored in a large array, 
approximately 21 Mbytes in size. We will henceforth refer to this large matrix as the r 
matrix. The C-matrix elements are updated when a profile changes by extracting the 
appropriate elements from the T matrix and performing minor calculations. Since the T 
matrix elements are calculated for all code offsets the FFT can be effectively used to 
speed up the calculations. Since all code offsets are pre-calculated, there is no risk 
associated with rapidly changing multipath profiles. Under normal operating conditions 
when the number of users accessing system is constant the resources which must be 
allocated to extracting the C-matrix elements are minimal, and so extra resources may be 
allocated to the R-matrix calculation. 

Section 2 below outlines the calculation of the T matrix elements: It is shown that the r 
matrix elements are given in terms of a convolution. Section 3 shows how to calculate the 
r matrix elements using the FFT. Section 4 describes how the r-matrix elements might be 
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summary with conclusions is given in section 6 
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accessed from SDRAM. In section 5 various processing times are estimated, and a 



2. C-matrix Elements Expressed in Terms of Code Correlations 

The R-matrix elements are given in terms of the C-matrix elements as [1] 



9=1 q'= 

(1) 



where Cikqq{m']\s a five-dimensional matrix of code correlations. Both /and /c range from 1 
to Kv, where Ky is the number of virtual users. The indices q and q' range from 1 to L, the 
number of multipath components, which is assumed to be equal to 4. The symbol period 
!^ offset m' ranges from -1 to 1 . The total number of matrix elements to be calculated is then 

|i =3(K^Lf =3(800)^ =\32M complex elements, or 3.84 MB if each element is a byte. 

i| This number is cut in half, however, due to the symmetries [2] 

s 

Q,,J-m1 = -^C* [m'l (2) 



y 

The memory requirement is then 1 .92 MB. 

li'i 

5^ Referring to Equation (1) it is evident that each element of Qkqq{m'] is a complex dot 

2 product between a code vector Ci and a waveform vector Skqq\ The length of the code 

A; vector is 256. The waveform s^ft] is referred to as the signature waveform for the kth 

rij virtual user. This waveform is generated by passing the spread code sequence Cj^[n] 

through a pulse-shaping filter g[t] 



sdt]-1.8[t-pNJc,[p] (3) 

p=0 

where N = 256 and g[t] is the raised-cosine pulse shape. Since g[t] is a raised-cosine 
pulse as opposed to a root-raised-cosine pulse, the signature waveform Sk[t] includes the 
effects of filtering by the matched chip filter. Note that for spreading factors less than 256 
some of the chips Ck[p] are zero. The length of the waveform vector is Lg -f 255Nc, where 
Lg is the length of the raised-cosine pulse vector g[t] and Nc is the number of samples per 
chip. The values for these parameters as currently implemented are Lg = 48 and Nc = 4. 
The length of the waveform vector is then 1068, but for the dot product it is accessed at a 
stride of Nc = 4, which gives effectively a length of 267. 

The raised-cosine pulse vector g[t] is defined to be non-zero from t - -Lg/2 + 1:Lg/2, with 
g[OJ = 1, With this definition the waveform Sk[t] Is non-zero from t = -Lg/2 + 1: Lg/2-h 255Nc. 
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By combining Equations (1) and (3) the calculation of the C-matrix elements can be 
expressed directly in terms of the user code correlations. These correlations can be 
calculated up front and stored in SDRAM. The C-matrix elements expressed in terms of 
the code correlations Fiklm] are 



^0 



2N, „ , 



= J,g[mN,+T]~'£c;[n]cdn-m] (4) 



tfy Since the pulse shape vector g[n] is of length Lg there are at most 2LJNc = 24 real macs 

y to be performed to calculate each element Cikqq{m'l (The factor of 2 is because the code 

s correlations Tik[m] are complex.) Given x it is important to be able to efficiently calculate 

O the range of values m for which g[mNc -f-ijxs non-zero. The minimum value of m is given 

y by rrtminiNc + t = - L^2 + 1 . Now t is given by t = m'NNc + tyg - T/cqv If each x value is 

f*^^ decomposed xiq = niqNc + piq, then mmmi = ceil[ (-x - L^2 + 1)/A/c ] = -m'N- Pfq + Pfcq' ~ 

jP Lg/{2Nc) + ceil[ (p/cg - P/q + 1)/Wc ]. Now ceil[ {pkq'- Piq + ^)INc ] will be either 0 or 1. It is 

1*=; convenient to set this to 0. In order that we do not access values outside the allocation for 

gin] we must set g[n] = 0.0 for n = - L/2; - L/2 - (A/c - 1 ). Note that of the A// possible 
values for ceil[ (p/cq - P/q + 1 )INc ], all but one are 0. Hence we have 

m^^, ^-rrCN-n,^ ^n,^,^L^ l{2N^) (5) 

Note that Lg must be divisible by 2Nc, and that Lg/{2Nc) should be a system constant. 

The maximum value of m is given by nimaxiNc + x = Lg/2. This gives mmaxi ~ floor[ (-x + 
LJ2)INo ] = -m'N- niq + rikq^ + Q{2Nc) + floor[ {pkq— Piq)INc ]. Now floor[ [p^q— piq)INo ] will 
be either -1 or 0. It is convenient to set this to 0. In order that we do not access values 
outside the allocation for g[n]M\fe must set g[n]= 0.0 for n = -Lg/2 + 1: Lg/2 + A/c. Note that 
of the Nf possible values for floor[ (p/cg— Pfq)/Nc ], about half are 0. Hence we have 

'Wmaxi =-mW-n,^ +«,^. /(27V,) (6) 
These values are quickly calculable. 
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The r matrix is calculated in the next section for all values m by exploiting the FFT. Notice 
that the calculation of the C-matrix elements requires only a small subset of the r matrix 
elements. 

3. Using the FFT to Calculate the r-matrix Elements 

In the previous section it was shown that the r-matrix elements can be represented as a 
convolution. This fact is here exploited to calculate the r-nnatrix elements using the FFT 
convolution theorem. From Equation (4) the r-matrix elements are 

^ik M - —^Ci[n] 'C^[n- m] (7) 

where N = 256. Three streams are related by this equation. In order to apply the 
convolution theorem all three streams must be defined over the same time interval. The 
code streams CiJ[n] and c/nj are non-zero from n = 0:255. These intervals are based on 

I - the maximum spreading factor. For higher data-rate users the intervals over which the 

streams are non-zero are reduced further. We are concerned here, however, with the 

1^ intervals derived from the highest spreading factor since these will be the largest intervals 

J and we wish to define a common interval for all streams. The common interval allows the 

^ FFTs to be reused for all user interactions. 



14 



H = 256 



Cfe[n-nvJ Ck[n-nvax] 



q(n} IS=128 



n = -256 n = 0 n = 255 

Figure 1. Interval for FFT calculation of the F matrix elements. Shown 
For the case where Afc = 256 and M = 128, 

The range of values m for which Tik[m] is non-zero can be derived from the above 
intervals. The maximum value of m is limited by n-m > 0, which gives 

255-m„3,=0 =^ m_=255 (8) 
and the minimum value is limited by n—m^ 255 , which gives 

O-'^^n.u, =255 m^^=-255 (9) 

To achieve a common interval for all three streams we select the interval m --Mf2:M/2 - 
7, M = 512. Where necessary the streams are zero-padded to fill up the interval. 
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Now, the DFT and IDFT of the streams are 



J2mr/M 



2 

i-' 



^ 2 



which gives 



^-1 

2 



1 



W ZyV;M MM , M 



2 ~ 2 2 " 

- 72wi(r-r')/M 



'^2 2 "~ 2 



(10) 



(11) 



Hence Tik[mJ can be calculated using the FFTs. Notice that the FFT gives values for all m. 
From the analysis above we know that many of these values will be zero for high data rate 
users. To conserve memory we wish to. store only the non-zero values. The values of m 
for which Fikfrn] is non-zero can be determined analytically. This subject is treated in the 
next section where the storage and retrieval of the r-matrix elements is considered. 



4. Storage and Retrieval of r-matrix Elements 

In order to efficiently store the r-matrix elements we must determine which values are 
non-zero. For high data rate users certain elements c/nj are zero, even within the interval 
n = 0:N -t N - 256. These zero values reduce the interval over which T/z/m/ is non-zero. 
In order to determine the interval for non-zero values consider 

'^^klrn]^^%<^>Vc,[n-m\ (12) 
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Define index // for the hh virtual user such that Ci[n] is non-zero only over the interval 
n = jjNi : jjNi +Ni-l. Correspondingly, the vector Ck[n]\s non-zero only over the interval 

n = jj^N^ : jf^Nj^ -{-N^-l. Given these definitions rik[m]can be rewritten as 



(13) 



o 

19 

y 

u 



fti 



The minimum value of m for which Tn^m] is non-zero is 
and the maximum value of m for which r/idm] is non-zero is 
The total number of non-zero elements is then 



(14) 



(15) 



(16) 



= N,+N,-1 



Table 1 below gives the number of bytes per l,k virtual-user pair based on 2 bytes per 
element - one byte for the real part and one byte for the imaginary part. 

Table 1. Number of bytes per l,k virtual user pair based on 2 bytes per element. 





Nk=^256 


128 


"64 


32 


16 


8 


•■;\4,. ^ 




1022 


766 


638 


574 


542 


526 


518 


128 


766 


510 


382 


318 


286 


270 


262 


64 


638 


382 


254 


190 


158 


142 


134 


.■32 


574 


318 


190 


126 


94 


78 


70 


16 ; 


542 


286 


158 


94 


62 


46 


38 


r-r"; 8 


526 


270 


142 


78 


46 


30 


22 


:.:'^-:4 


518 


262 


134 


70 


38 


22 


14 



Now we are in a position to deternr^ine the memory requirements for the r matrix for a 
given number of users at each spreading factor. Let there be Kq virtual users at spreading 
factor s 2^"^ , q = 0:6, where Kq is the qth element of the vector K. Note that some 

elements of K may be zero. Let Table 1 above be stored in matrix M with elements Mqq'. 
For example, Moo = 1022, and Mqi = 766. The total memory required by the r matrix in 
bytes is then 



6 1^: (i^ +1) ^ 



9 I ? ^9 Ami 9 <i ^ 



(17) 
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For example, for 200 virtual users at spreading factor No = 256 we have Kq = 2005^0, 
which gives Mi^ytes = V2Ko(Ko + 1 )Moo = 1 00(201 )(1 022) = 20.5 MB. 

For 1 0 384 Kbps users we have Kq = Kodqo + KeSqe with Ki? = 1 0 and Ke = 640. This gives 
Mt,ytes = 'AiK^Ko + 1)Moo + KoKeMoe + V^We + 1)^66 = 5{11)(1022) + 10{640)(518) + 
320(641)(14)=6.2MB. 

Now consider addressing, storing and accessing the r-matrix data. For each pair f/,/cj, k 
>= / we have 1 complex value Tikfm] value for each value of m, where m ranges from mmfn2 
to mmax2, and the total number of non-zero elements is rritotai = rnmax2 - fTirrvr)2 + 1- Hence for 
each pair (l,kl k>=l^e have 2 mi-o^/ time-contiguous bytes. To access the data, create an 
array of structures: 



struct { 

int m_min2; 

intm_max2; 
i,^ int m_total; 

n char * Glk; 

Q } GJnfo[N^ VU_MAX][ AL VU^MAX]; 

ifl The C-matrix data is then retrieved using something like: 

W nim}n2 = GJnfo[l][k].m_mfn2 

Itf rnmax2 = G_info[l][k].m_max2 

N1 =m'*N-Lg/{2Nc) 
form' =0:1 

for q = 0:L ~t 
f for q' =: 0:L -1 

nf^min1= Nl-niq+nkq' 
nimaxl = rnminl + Ng 

nimin = max[ mmim , rnmin2 ] 
nimax = min[ nirmxl , rr)max2 ] 
if rrimax f^min 

mspan ~ nimax ~ nTlrnin + 1 

sum1 -0.0; 

ptr1 = &GJnfo[l][k].GIk[mmin] 
ptr2 = &g[mmin *Nc-hx] 
wt) ile mspan > 0 

sum 1 += ( *ptr1++ )*( *ptr2++ ) 



end 



end 

C[mW][k][q][ql = sum1 



end 

end 

end 
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5. Estimated Processing Times 

The following processing times are estimated below: 

• Calculate r-matrix elements 

• Write to r-matrix elements to SDRAM 

• Pack r-matrix elements in SDRAM 

• Extract r-matrix elements/Form C-matrix from SDRAM 

• Write C-matrix elements to L2 cache 

• Pack C-matrix elements in L2 cache 



b 



Processing times are calculated for two cases of interest. The first case is where /C= 100 
users {Kv = 200 virtual users) are accessing the system and a voice user is added to the 
system. Not all of these users are active. The control channels are always active, but the 
data channels have activity factor AF = 0.4. The mean number of active virtual users is 
then K -h AF*K = 140. The standard deviation is <t = ^/T - • (1 - AF) = 4.90 . With high 
probability, then, we have /C < 140 + 3a < 155 active users. 

The second case is the worst case scenario. This occurs when a number of voice users 
are accessing the system and a single 384 Kbps data user is added. A single 384 Kbps 
^ data user adds interference equal to (.25 + 0.1 25*1 00)/(.25 + 0.400*1) ~= 20 voice users. 

^ Hence, the number of voice users accessing the system must be reduced to 

IM approximately /C= 100 - 20 = 80 (/C = 160). The 3a number of active virtual users is then 

y 80 + (0.125)80 +3(3.0) = 99 active virtual users. The reason this scenario is stressful is 

s that when a single 384 Kbps data user is added to the system, J + 1 = 64 +1 = 65 virtual 

p users are added to the system. 

W 

|:s5: 

4^ Calculate r-matrix elements 

The r-matrix elements can be calculated in one of two ways. The first is using the SAL 
zconvx to perform the direct convolution. The second is using the SAL fft_zipx to perform 
the calculation via the FFT. The first method is preferable when the vector lengths are 
small. SAL timing are given in Table 2. These timings are based on a 400 MHz PPC7400 
with 160MHz, 2MB L2 cache. The data is assumed resident in L1 cache. The 
performance loss for data L2 cache resident is not severe. 



Table 2. SAL timings and G FLOPS for zconvx function 



■f Motel 


. ; N, . 


Timing (ms) 


GFLOPS 


1024 


4 


19.33 


1.70 


1024 


8 


29.73 


2.20 


1024 


16 


50.55 


2.59 


1024 


32 


92.32 


2.84 


1024 


64 


176.53 


2.97 


1024 


128 


346.80 


3.47 



The time to perform a 612 complex FFT, with in -place calculation (fft_zipx), on a 400 MHz 
PPC7400 with 160MHz, 2MB L2 cache is 10.94 ^is for data LI resident. Prior to 
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performing the (final) FFT we must perform a complex vector multiply of length 512. The 
SAL timings for zvmulx are given in Table 3. 



Table 3, SAL timings and GFLOPS forzvmulx function 









GFLOPS 


1024 


LI 


4.46 


1.38 


1024 


12 


24.27 


0.253 


1024 


DRAM 


61.49 


0.100 



We will also be interested in the time to move data. Hence the SAL timings forzvmovx are 
given in Table 4. 



^0 

m 

m 
W 



Table 4. SA 


L timings forzvmovx function 




■J ; .Locatfqn ~ 




1024 


L1 


1.20 


1024 


L2 


15.34 


1024 


DRAM 


30.05 



Figure 2 shows the elements that must be calculated (in gray) when a physical user is 
added to the system. When a physical user is added to the system there are 1 + J virtual 
users added to the systems: that is, 1 control channel + J = 256/SF data channels. The 
number Ky represents the number of virtual users that are using the system to begin with. 



Ik 



Columns k 



1 J 



Rowsl 



Kv 



Figure 2. Eiements ttiat must be calculated (in gray) 
when a physical user is added to the system. 

Hence there are (/C + 1) elements added due to the control channel, and J{Kv + 1) + J(J + 
1 )/2 elements added due to the data channels. The total number of elements added is 
then (J+1)[K^+1 + J/2]. 
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Suppose that the FFT is used to perform the calculations. The total number of FFTs to 
perform is (J + 1) + (J + 1)[/<v + 1 + J/2], The first term represents the FFTs to transform 
Ck[nl and the second term represents the (J + 1)[Kv + 1 + J/2] inverse FFTs of 
FFT{Ck[n]}*FFT{Q[n]}. The time to perform the complex 512 FFTs is 10.94 jxs, whereas 
the time to perform the complex vector multiply and the complex 512 FFT is 24.27/2 + 
10.94 = 23.08 lis. 

For the first scenario there are Ky = 200 virtual users accessing the system and a voice 
user is added to the system ( J = 1). The total time to add the voice user is then (1 + 
1){10.94 |xs) + (1 + 1)[200 + 1 + 1/2](23.08 lis) = 9.3 ms. 

For the second scenario there are /C = 160 virtual users accessing the system and a 384 
Kbps data user is added to the system {J = 64). The total time to add the 384 Kbps user is 
then (64 + 1)(10.94 jis) + (64 + 1)[160 + 1 + 64/2](23.08 lis) = 290 ms! This number is 
way too big and hence for high data-rate users, at least, the r-matrix elennents must be 
calculated via convolutions. 



w 

hi 



The direct method to calculate the r-matrix elements is to use the SAL zconvx function to 
perform the convolution 



/TV; «=0 



(18) 



m 

I*" 



m 



For each value of m there are Nmm = min{A//, AZ/^} complex macs (cmacs). Each cmac 
requires 8 flops, and there are m^fa/ = Ni + Nk- ^ m-values to calculate. Hence the total 
number of flops is 8Nmin(Ni + - 1). For what follows we assume the convolution 
calculation is performed at 1 .50 GOPs = 1500 ops/jis. The calculation time to perfomi the 
convolutions is presented in Table 5. 

Table 5. Calculation time(iLis) to perform the r-matrix convolutions 













. - 16 V ; 


"■'';8: ■. \ 


4 


: :N, =256:. 


697.69 


261.46 


108.89 


48.98 


23.13 


11.22 


5.53 




261.46 


174.08 


65.19 


27.14 


12.20 


5.76 


2.79 


r 64 


108.89 


65.19 


43.35 


16.21 


6.74 


3.03 


1.43 


V 32 . , c 


48.98 


27.14 


16.21 


10.75 


4.01 


1.66 


0.75 




23.13 


12.20 


6.74 


4.01 


2.65 


0.98 


0.41 


i- ^8:--'- 


11.22 


5.76 


3.03 


1.66 


0.98 


0.64 


0.23 




5.53 


2.79 


1.43 


0.75 


0.41 


0.23 


0.15 



The shaded cells indicate times faster than the 23.08 iis FFT time. Equation 17 gives the 
size of the r-matrix in bytes. Similarly, the total time to calculate the r-matrix is 
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=^ik^,,+i^,v«| (19) 

diag (T ) + i^: • r ■ a: ] 

where Tqg are the elements in Table 5. Now suppose K'=^ K+ A, where Aq = Jxdqx + Jy^jy, 
where x and y are not equal. Then 



(20) 

^=0 



For the first scenario there are Ky = 200 virtual users accessing the system and a voice 
^ user is added to the system ( J = 1). Hence we have Kq = KySqo (SF = 256), Kv - 200, Jx-J 

f% = 2 and Jy = 0. The total time is then 



J VzJiJ -^1)700-9- JKvToo = (0.5)(2)(3)(0.70 ms) + (2)(200)(0.70 ms) = 283 ms 

s 

^ This number is way too big and hence for voice users, at least, the r-matrix elements 

y must be calculated via FFTs. 



For the second scenario there are = 160 virtual users accessing the system and a 384 
Kbps data user is added to the system (J= 64). Hence we have Kq = KySqo (SF = 256), Ky 
1*5: = 1 60, Jx = 1 (control) and Jy = J = 64 (data). The total time is then 

6 (/<.+ 1)7^ + J(K,+ 1)Ti^ + (J+1)(J72)7W 

ril = (161)(697.7 lis) + (64)(161)(5.53 jtis) + (65)(32)(0.15 \is) = 

1 12.33 ms + 56.98 ms + 0.31 ms = 169.62 ms 

Since Too = 697.7 ^.s is so large, these calculations should be performed using the FFT, 
which costs 23.08 iis per convolution. We also have 1 FFTs to compute FfT{cR[n}i) for 
the single control channel. This costs an additional 10.94 ^is. The total time, then, to add 
the 384 Kbps user is 

10.94 lis + (161)(23.08) \iS + (64)(161)(5.53) ^is + (65)(32)(0.15) tis = 
= 61 .02 ms 
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Write to r-matrix elements to SDRAM 

The numbers in Table 1 represent the 2/7?^^/ bytes per r-matrix element. Recall that the 
size of the r-matrix in bytes from Equation 17 is 

=^ik^,,+i^,^,-^«| (21) 

= I [«■ • diag(M) + K''Mk] 
Now suppose K'= K+A, where Aq = Jxdqx + JySqy, where xand y are not equal. Then 

C| =ij,(J, +-J^(J^ + l)M^+JJ^M^ (22) 

W q=0 

m 

W Consider the first scenario where Kq = 200Sqo (SF = 256) and that a single voice user is 

added to the system: Jx-2 (data plus control), and Jy = 0. The total number of bytes is 
then 0.5(2)(3)(1022) + 200(2)(1022) = 0.412 MB. The SDRAM write speed is 133MHz*8 
bytes * 0.5 = 532 MB/s. The time to write to SDRAM is then 0.774 ms. 



^ Now for the second scenario Kq = 160Sqo (SF = 256), and that a single 384 Kbps (SF = 4) 

^ user is added to the system: Jx = 1 (control) and Jy = 64 (data). The total number of bytes 

is then 0.5(1 )(2)(1 022) + 0.5(64)(65)(14) + 160{1(1022) + 64(518)} = 5.498 MB. The 
SDRAM write speed is 133MHz*8 bytes * 0.5 = 532 MB/s. The time to write to SDRAM is 
then 10.33 ms. 



Pack r-matrix elements in SDRAM 



The maximum total size of the r-matrix is 20.5 MB. Suppose that in order to pack the 
matrix every element must be moved. This is the worst case. The SDRAM speed is 
133MHz*8 bytes * 0.5 = 532 MB/s. The move time is then 2(20.5 MB)/(532 MB/s) = 77.1 
ms. If the r-matrix is divided over three processors this time is reduced by a factor of 3. 
The packing can be done incrementally, so there is no strict time limit. 
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Extract r-matrix elements/Form C-matrix from SDRAM 

Recall that the C-matrix data is retrieved using something like: 

mmin2 = GJnfo[l][k],m__min2 
nrimaxs = G_info[l][klm_rnax2 

Ng=Lg/No 

N1 =m'*N-Lg/{2No) 
form' =0:1 

forq = 0:L -1 

forq'=0:L-1 

nirrvn1= N1 - Hfq -h Pf^q' 
rrimaxl = nim/nr + Ng 
mmin = niax[ mmini , nimin2 ] 
mmax = min[ rrirmxl , nimax2 ] 
1^ if mmax >= fTlmJn 

^span ~ ^max mmin "f" 1 

C3 sum1 = 0.0; 

^ ptn = &GJnfo[l][k],Glk[mminJ 

g ptr2 = &g[mmin *Nc-hx] 

© while rrispan > 0 

sum 1 -h= ( *ptrU+ ) * ( *ptr2^-h ) 

mspan 

end 

C[m'MkJ[qJ[q'J = sum1 



2 



5.' .V 



end 



end 



I "3 



end 



end 



Time to extract elements when a new user is added to the system 

We calculated above the time to calculate the r-matrix elements when a new user is 
added to the system. Here we consider the time to extract the corresponding C-matrix 
elements. 

Notice that Glk[m]are accessed from SDRAM. Values will almost certainly not be in either 
LI or L2 cache. For a given (l,k) pair, however, the spread in t will for most cases be less 
than 8 jxs (i.e for a 4 jis delay spread), which equates to (8 |is)(4 chips/)LLS)(2 bytes/chip) = 
64 bytes, or 2 cache lines. Since data must be read in for two values of m' a total of 4 
cache lines must be read. This will require 16 clocks, or about 16/133 = 0,12 ^is. 
However, measured results for zvmovx indicate that accesses to SDRAM are performed 
at about 50% efficiency so that the required time is about 0.24 )as. 

Now suppose, for example, user / = x is added to the system. We must fetch the elements 
C[m'][x][k][q][q'] ior ail m\ k, qand q\ As indicated above, all the m\ qand qf' values will 
be contained typically in 4 cache lines. Hence if there are /C virtual users we must read in 
4/Cv cache lines, or 32Kv clocks, where we have doubled the clocks to account for the 50% 
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efficiency. In general J + 1 virtual users are added to the systenn at a time. This will 
require 32Kv(J+ 1) clocks. 

For the first case where we have 155 active virtual users and a new voice user is added 
to the system, the time required to read in the C-matrix elements will be 32(155)(1 + 1) 
clocks/(133 clocks/|is) = 74.6 lis. The industry standard hold time th for a voice call is 140 
s. The average rate X of users added to the system can be determined from Xt^ = /C, 
where Kis the average number of users using the system. For /C= 100 users we have X = 
100/140 s = 1 users added per 1.4 s. 

For the case where we have 99 active virtual users and a 384 Kbps user is added to the 
system, the time required to read in the C-matrix elements will be 32(99)(64 + 1) 
clocks/(133 clocks/|is) = 1.55 ms. However data users presumably will be added to the 
system more infrequently than voice users. 



Time to extract elements when % changes 

Now suppose, for example, user / = x lag q y changes. Then we must fetch the elements 
C[m'][x][k][y][q'] ior all m\ /cand q\ All the q' values will be contained typically in 1 cache 
O line. Hence we must read in 2{Kv){^) = 2Kv cache lines, or 16fC clocks, where we have 

doubled the clocks to account for the 50% efficiency. In general, when a lag changes 
^ there are J + 1 virtual users for which the C-matrix elements must be updated. This will 

^ require 16/C(l/+ 1) clocks. 



For the first case where we have 155 active virtual users and a voice user's profile (one 
lag) changes, the time required to read in the C-matrix elements will be 16(155)(1 + 1) 
clocks/(133 clocks/iis) = 37.3 \is. Recall that for high mobility users such changes should 
occur at a rate of about 1 per 100 ms per physical user. This equates to about once per 
1.33 ms processing interval if there are 100 physical users so that approximately 37.3 \is 
will be required every 1 .33 ms. 

For the case where we have 99 virtual users and a 384 Kbps data user's profile (one lag) 
changes, the time required to read in the C-matrix elements will be 16(99)(64 + 1) 
clocks/(133 clocks/jLis) = 0.774 ms. However data users will have lower mobility and hence 
such changes should occur infrequently. 



Write C-matrix elements to L2 cactie 

Time to write elements when a new user is added to the svstem 

Consider again the case where user I = x \s added to the system. We must write elements 
C[m'][x][k][q][q'] for all m\ k, q and q\ If there are Kv active virtual users we must write 
4KvL^ bytes, where we have doubled the bytes since the elements are complex. In general 
J + 1 virtual users are added to the system at a time. This will require AKvL^(J + 1) bytes to 
be written to L2 cache. 

For the first case where we have 155 active virtual users and a new voice user is added 
to the system, the time required to write the C-matrix elements will be 4(155)(16)(1 + 1) 
bytes/(2128 bytes/tis) = 9.3 [ns. 
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For the second case where we have 99 active virtual users and a 384 Kbps user is added 
to the system, the time required to write the C-matrix elements will be 4(99)(16)(64 + 1) 
bytes/(2128 bytes/|is) = 193.5 jiis. Recall, however, that data users presumably will be 
added to the system more infrequently than voice users. 



Time to extract elements when % changes 

Now suppose, for example, user / = x lag q = y changes. We must write elements 
0[m'][x][kj[q][q'] ior all m\ /cand q\ If there are active virtual users we must write 4KvL 
bytes, where we have doubled the bytes since the elements are complex. In general J+^ 
virtual users are added to the system at a time. This will require 4KvL{J + 1) bytes to be 
written to L2 cache. 

For the first case where we have 155 active virtual users and a voice user's profile (one 
lag) changes, the time required to write the C-matrix elements will be 4(155)(4)(1 + 1) 
bytes/(2128 bytes/fis) = 2.33 \is, 

D For the second case where we have 99 active virtual users and a 384 Kbps data user's 

O profile (one lag) changes, the time required to write the C-matrix elements will be 

'^M 4(99)(4)(64 + 1)bytes/(2128 bytes/^s) = 48.4 iis. However data users will have lower 

^ mobility and hence such changes should occur infrequently. 



m 

W 



Pack C-matrix elements in L2 cache 



The C-matrix elements will need to be packed in memory every time a new user is added 
to or deleted from the system and every time a new user becomes active or inactive. The 
g size of the C-matrix is 2(3/2)(/CL/ = 3(KyLf bytes, however, divided over three 

i processors this becomes {KvLf bytes per processor. Assume that the entire matrix must 

y be moved. The move is within L2 cache. Hence the total move time is 2{KvLf bytes/(2128 

bytes/jis), where the factor of 2 accounts for read and write. 

For the first case where we have 155 active virtual users the time required to move the C- 
matrix elements will be 2(155*4)^ bytes/(2128 bytes/|is) = 0.361 ms. 

For the first case where we have 99 active virtual users the time required to move the C- 
matrix elements will be 2(99*4)^ bytes/(2128 bytes/tis) = 0.147 ms. 

These events will occur typically once every 10 ms, that is, once per frame. 
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6. Summary and Conclusions 

In summary, we have determined 

• The r-matrix will require approximately 20.5 MB of SDRAM 

• To efficiently calculate the r-matrix elements will require both direct convolution and 
FFT calculations 

• To pack the T matrix in SDRAM will require approximately 77.1 ms 



The following processing times are estimated: 



Estimated Processing Times 


Case 1 

(voice user added) 


(384 Kbp^u^add^^ ^ 


Calculate T-matrix elements 


9.3 ms 


61 .0 ms 


Write r-matrix elements to SDRAM 


0.77 ms 


10.3 ms 


Extract C-matrix elements when 






New user added 


75 \is 


1.6 ms 


Multipath profile changes 


37 lis 


0.77 ms 


Write C-matrix elements to L2 when 






New user added 


9.3 )xs 


194 fis 


Multipath profile changes 


2.3 us 


48 ns 


Pack C-matrix elements in L2 cache 


361 lis 


147 lis 



y These times are based on a single but devoted G4 allocated to perform the calculations. 

b 
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^0 The C-matrix elements can be represented in terms of the underlying code 

correlations using 

W 1 _ _ 

W =rTrEE^K«-/')^c +m'T + r\ -V] cjp] c;[n] 

1^ ^^i - p 



C3 2N 

-If. ^'^ I n m 



= TTT Sl'nN^ + T] • [n - m] - c'[n] 



The r-matrix represents the correlation between the complex user codes. The 
complex code for user / is assumed to be infinite in length, but with only A// non- 
zero values. The non-zero values are constrained to be ±l±j. The T-matrix can 
represented in terms of the real and imaginary parts of the complex user codes 
becomes 
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m 
w 



hi 



where 



= f "J ■ ^* [«-'"] + In] ■ ci [n - m] 

+ jcf [n] • cl [n - ml - jc/ [n] • [n - m]} 

= r,f[m] + r,f[m]+y{r,f[m]-r,f[m]} 



r,f[m] = -i-Xcf[«]-cf[n-/n] 
rl'[m]^^J^c;[n]ci[n-mi 



I n 

r,f[m]s-i-£c;[n]cf[n-m] 



(2) 



(3) 



Consider any one of the above real correlations, denoted 

r,f[m]s-i-£c/[n]c[[n-m] (4) 

where X and Y can be either R or /. Since the elements of the codes are now 
constrained to be ± 1 or 0, we can define 

cf[nl = (l-2yf[n])-mf[n] (5) 

where y!^ M and mf [n] are both either zero or one. The sequence mf [n] is a 
mask used to account for values of cf [n] that are zero. With these definitions 
Equation (4) becomes 
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r,f [m] = -^X (l - 2rf [«])• ml' [«] • (l - 2yl [n - m])- m[ [n - m] 

If 

= TTtX (l - ^r,"" [«])• (l - 2r[ [« - m])- w,' [«] • m[ [n - m] 



(6) 



= :r|r- X 6 - 2(r,'' W ® [« - /«])]• [«1 • ml [n - m] 

- 2X (r/' [n] © yI [n - m])- mf [n] ■ ml [n - m]| 
= ^{A^fM-27Vf[m]} 

^ M ® rlln - m])- mf [n]ml[n - m] 

n 

where 0 indicates moclulo-2 addition (or logical XOR). 

The hardware to perform these operations is shown in Figures 1-3. Figure 1 
shows the initial register configuration after loading code and mask sequences. 
The boolean functions are shown in Figure 2, and Figure 3 shows the register 
configuration after a number of shifts. 



Load mask & code for 
user 1 here (256 chips) 




Initialize with 
all zeros 



Sum 



Boolean operations 



Output 



Figure 1. Initial register configuration after loading code and mask sequences. 
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Mask 
Codek 




Figure 2, Boolean functions. 
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Load zeros in from left 



Perform a total of 512 shifts, 
shifting mask k and code k 
out of registers at nght 



I 



Sum 



Figure 3. Register configuration after a number of sliifts. 
The above hardware calculates the functions MfJlm] and 



Output 



The 



/f[m] and N;/[m\. 
remaining calculations to form r^f^M and subsequently F/Jm] can be 
performed in software. Note that the four functions r^^^[m] corrsponding to X, Y = 
R, / which are connponents of T^Jw] can be calculated in parallel. For Kv= 200 
virtual users, and assuming that 10% of all (/, A) pairs must be calculated in 2 ms, 
then for real-time operation we must calculate 0.10(200f = 4000 r^Jm] elements 

(all shifts) in 2 ms, or about 2M elements (all shifts) per second. For /C = 128 
virtual users the requirement drops to 0.81 92M elements (all shifts) per second. 

In what has been presented ther/^[m] elements are calculated for ail 512 shifts. 
Not all of these shifts are needed, so it is possible to reduce the number of 
calculations per Tu^lm] elements. The cost is increased design complexity. 
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Makefile 2/23/2001 
# 

.SUFFIXES: .a .c .mac .o .S 

ARCH = PPC74 00 
MUDLIB = mudlib.a 

###CFLAGS = -Ot -t ${ARCH} -I. -DCOMPILE_C 

CFLAGS = -Ot -t ${ARCH} -I. 

ASFLAGS = -t ${ARCH} -DBUILD_MAX -I. 

# 

# Make object files 
# 



.CO: 



ccmc ${CFIjAGS} -o $*.o -c $*.c 



# 

# Make ASM 

# 

.mac.o: 



rm -f $*.S 

cp $ * . mac $ * . S 

ccmc ${ASFLAGS} -o $*.o -c $*.S 



rm -f $*.S 
OBJS = \ 

get sizes.o \ 
get sizes v.o \ 
reformat corr.o \ 
rmats.o \ 
reformat_r.o \ 
mpic.o \ 
gen x row.o \ 
gen r sums.o \ 
* gen r sums2 . o \ 

CJi gen r matrices. o \ 

mtrans32 Sbit.o \ 
mtriangle Sbit.o \ 
1^ dotprS Sbit.o \ 

£ dotpre Sbit.o \ 

dotpr9 Sbit.o \ 
z^t sve3 Sbit.o \ 

fij fixed cdotpr.o \ 

zdotpr4 vmx.o \ 
zdotpr_vmx . o 

${MUDLIB}: Makefile ${OBJS} 
armc -c $@ ${OBJS} 

# 

# Cleanup 
# 

clean: 

rm ~f ${OBJS} *.S ${MUDLIB} 

get sizes.o: mudlib.h get_sizes.c 

ref ormat_corr .o: mudlib.h ref ormat_corr . c 

rmats.o : mudlib.h rmats.o \ 

gen x row. mac gen r__sums.mac gen__r_sums2 .mac 

gen r matrices. mac 
reformat r.o: mudlib.h reformat_r.c 
mpic.o: mudlib.h mpic.c \ 

dotprS Sbit.mac dotpr6_Sbit .mac dotpr9_8bit .mac 

sve3_8bi t . mac 

dotpr3 Sbit.o: dotpr3 Sbit.mac salppc.inc 
dotpr6_Sbit .o: dotpr6_8bit .mac salppc.inc 



1 
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dotpr9 Sbit.o: dotpr9 Sbit.mac salppc.inc 

sye3 Sbit.o: sve3 Sbit.mac salppc.inc 

fixed cdotpr.o: zdotpr4 vmx.mac salppc.inc 

zdotpr4_VTnx . o : zdotpr4_vmx . mac 2dotpr4_vnix . k salppc.inc 



m 
m 

W 



ry 



2 
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W 



#include "xnudlib.h" 

#define DO CALC STATS 0 

#define DO TRUNCATE 1 

# define DO SATURATE 1 

#define DO_SQUELCH 0 

#define SQUELCH THRESH 1.0 
#define TRUNCATE_BIAS 0.0 

#if DO TRUNCATE 

#define SATURATE_THRESH (128.0 + TRUNCATE_BIAS ) 
#else 

#define SATURATE_THRESH 127.5 
#endif 

#define SATURATE ( f ) \ 
{ \ 

if { (f) >= SATURATE THRESH ) f = (SATURATE THRESH - 1.0); \ 
^ else if ( (f) < -SATURATE_THRESH ) f = -SATURATE_THRESH; \ 



#if DO_TRUNCATE 
#if 0 

#define BF8 FIX( f ) 



#define BF8_FIX{ f ) 
#else 

#define BF8 FIX( f ) 
\ 

#endif 



({BF8) (FABS(f) <= TRUNCATE BIAS) ? 0 : \ 
({(f) > 0.0) ? ((f) - TRUNCATE BIAS) : \ 
((f) + TRUNCATE BIAS))) 

( (BF8) (f ) ) 

( (BF8) ( ( ( ( (f ) < 0.0)) ((f) == (float) ( (int) (f) )) ) ? 

((f) + 1.0) : (f))) 



#else 
#def ine 
#endif 



BF8_FIX( f ) ((BF8)(((f) >= 0.0) ? ((f)+0.5) : ((f)-0.5))) 



#define UPDATE MAX ( f, max ) \ 

if ( FABS ( f ) > max ) max = FABS ( f ) ; 

#define uchar unsigned char 
#define ushort unsigned short 
#define ulong unsigned long 

#if DO_CALC STATS 

Static float max_R value; 

#endif 

void gen X row ( 

COMPLEX BF16 *mpathl bf, 
COMPLEX BF16 *mpath2_bf , 
COMPLEX BF16 *X_bf , 
int phys index, 
int tot_phys_users 

) / 

void gen R sums ( 

COMPLEX BF16 *X bf , 
COMPLEX BF8 *corr_bf , 
uchar *ptov map, 
BF32 *R sums, 
int num_jphys_users 

) ; 



void gen_R_sums2 { 
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) 



COMPLEX BF16 *X bf, 
COMPLEX BF8 *corra bf , 
COMPLEX BF8 *corrb__bf, 
uchar *ptov map, 
BF32 *R sumsa, 
BF32 *R sumsb, 
int num_j)hys_users 



o 



m 
w 



void gen R matrices ( 
BF32 *R sums, 
float *bf scalep, 
float *inv scalep, 
float *scalep, 
BF8 *no scale row bf , 
BF8 *scale row bf , 
int num_virt_users 

); 



void 



mudlib gen R ( 
COMPLEX BF16 
COMPLEX BF16 
COMPLEX BF8 
COMPLEX BF8 
uchar 
float 
float 
float 
char 



*mpathl bf, 
*mpath2 bf, 
•^corr 0 bf, 
»^corr 1 bf. 



*ptov map, 
*bf scalep, 
*inv scalep, 
*scalep, 
LI cachep, 
BF8 *R0 upper bf, 
BF8 *R0 lower bf, 
BF8 *R1 trans_bf, 
BF8 *Rlm bf , 
int tot phys users, 
int tot virt users, 
int start phys user, 
int start virt user, 
int end phys user, 
int end virt user 



) 



/* adjusted for starting physical user 

/* adjusted for starting physical user 

/* no more than 256 virts . per phys */ 

/* scalar: always a power of 2 */ 

/* start at 0 ' th physical user */ 

/* start at 0 ' th physical user */ 

/* temp: 32K bytes, 32-byte aligned */ 



/* zero-based starting row (inclusive) 

/* relative to start phys user */ 

/* zero-based ending row (inclusive) */ 

/* relative to end_phys_user */ 



COMPLEX BF16 *X bf ; 
BF32 *R sumsO, *R sumsl; 
uchar *RO_ptov_map; 

int bump, byte offset, i, iv, last virt user; 

int RO_align, RO_skipped_virt_users, RO_tcols, RO_virt_users , Rl_tcols; 

#if DO CALC STATS 

max R_value = 0.0; 
#endif 

X_bf = (C0MPLEX_BF16 *)Ll_cachep; 

byte offset = tot phys users * NUM FINGERS SQUARED * sizeof (COMPLEX BF16) ; 
R_sumsO = (BF32 *) ( ( (ulong)X bf + byte_offset + R_MATRIX_ALIGN_MASK) & 
~R_MATRIX_ALIGN_MASK) ; 

byte offset = tot virt users * sizeof (BF32) ; 

R_sumsl = (BF32 *) (((ulong)R sumsO + byte_offset + R_MATRIX_ALIGN_MASK) & 
~R_MATRIX_ALIGN_MASK) ; 

R0_ptov_map = (uchar *) (((ulong)R sumsl + byte offset + 

R MATRIX ALIGN MASK) & -R MATRIX ALIGN__MASK) ; 



Rl tools = (tot virt users + R MATRIX ALIGN MASK) & --R MATRIX ALIGN MASK; 



2 
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RO virt__users = 0; 

for ( i = start_phys user; i < tot phys_users; i++ ) { 
RO virt users += (int)ptov map[i]; 
RO_ptov_map [i] = ptov_map [i] ; 

RO ptov map [start phys user] -= start virt user; 

RO skipped virt users = tot virt users - RO_virt_users + start_virt_user ; 
RO__virt_users -= (start_virt_user + 1) ; 

--inv_scalep; /* predecrement to allow for common indexing */ 

for ( i = start_phys_user; i <= end_phys_user ; i++ ) { 

gen X row ( 
mpathl bf , 
nipath2_bf , 
X bf , 
i, 

tot_phys_users 

) ; 

--RO_ptov_map[i] ; /* excludes RO diagonal */ 

Q last_virt_user = (i < end_phys_user) ? ( (int)ptov map[i] - 1) : 

end_v i r t_us er ; 

%0 for ( iv = start_virt_user; (iv + 1) <= last_virt_user; iv += 2 ) { 

M: gen R sums 2 ( 

^ X bf + (i * NUM_FINGERS_SQUARED) , 

® corr 0 bf; 

l|J corr 0 bf + ( (R0_virt_users - 1) * NUM_FINGERS_SQUARED) , 

2' RO ptov_map + i, 

^ ^ R sumsO + (RO skipped virt users + 1) , 

l3 R sumsl + (RO skipped_virt_users + 1) , 

Iss tot_phys_users - i 

o 
m 



RO tcols = Rl tcols - (RO skipped_virt users & -R MATRIX_ALIGN_MASK) ; 
l*^ RO_align = (RO_skipped_virt_users & R_MATRIX_ALIGN_MASK) + 1; 

gen R matrices ( 

R sumsO + {RO_skipped_virt__users + 1) , 
bf scalep, 

inv scalep + (RO skipped virt users + 1) , 
scalep + (RO skipped virt_users + 1) , 
RO lower bf + RO align, 
RO upper bf + RO_align, 
RO_virt_users 

) ; 

RO_upper_bf [ RO_align - 1 ] = 0; /* zero diagonal element */ 

RO lower bf += RO tools; 
RO_upper_bf += RO_tcols; 

RO_tcols = Rl_tcols - ( (RO skipped virt users +1) & 

~R MATRIX ALIGN MASK) ; 
RO_align - ( (RO_skipped_virt_users + 1) & R_MATRIX_AIiIGN__MASK) + 1; 

gen R matrices ( 

R sumsl + (RO_skipped_virt_users + 2) , 
bf scalep, 

inv scalep + (RO skipped virt users + 2) , 
scalep + (RO skipped virt_users + 2) , 
RO_lower_bf + RO__align, 

3 
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RO upper bf + RO align, 
RO_virt_users - 1 

) ; 

RO_upper_bf E RO_align - i ] = 0; /* zero diagonal element */ 

RO lower bf += RO tools; 
RO_upperjDf += RO_tcols; 

/* 

* create ptov mapEi] number of 32-element dot products involving 

* X__bf [i] and corr_l_bf [i] [j] where 0 < j < ptov_map[i] 
*/ 

gen R sums 2 { 
X bf , 

corr 1 bf, 

corr 1 bf + (tot_virt_users * NUM_FINGERS_SQUARED) , 

ptov map, 

R sumsO, 

R sumsl, 

tot jhys__users 

) ; 

O * scale the results and create two output rows (1 per matrix) 

*J gen R matrices { 

R sumsO, 
%p bf scalep, 

inv scalep + {RO_skipped_virt_users + 1) , 
IS scalep, 

Rl trans_bf, 
llj Rim bf , 

t ot_vi rt_user s 

iip Rl trans bf += Rl tools; 

1^ Rlm_bf += Rl_tcols; 

*P gen R matrices ( 

Q R sumsl, 

|ss bf scalep, 

inv scalep + (RO_skipped_virt_users + 2), 

scalep, 

Rl trans_bf. 

Rim bf , 

tot_virt__users 

); 

Rl trans bf += Rl tcols; 
Rlm_bf Rl_tcols; 

corr 0 bf += (((2 * RO virt users) - 1) * NUM FINGERS SQUARED); 

corr 1 bf += { (2 * tot_yirt_users) * NUM_FINGERS_SQUARED) ; 

RO ptov map [i] -= 2; 

RO virt users -= 2; 

RO skipped_virt__users += 2; 

} ~ 

if ( iv <= last_virt_user ) { 

bump = RO ptov_map [ i 1 ? 0 : 1 ; 
gen R sums ( 

X bf + ((i + bump) * NUM_FINGERS_SQUARED) , 

corr 0 bf, 

RO ptov_map + i + bump, 

R_sumsO + {RO_skipped_virt_users + 1) , 
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totjphys_users - i - bump 

) ; 

RO tcols = Rl tcols - (RO skipped_virt users & -R MATRIX^ALIGN^MASK) ; 
RO_align = (RO_skipped_virt_users & R_MATRIX_ALIGN_MASK) + 1; 

gen R matrices ( 

R sumsO + (RO_skipped_virt_users + 1) , 
bf scalep, 

inv scalep + (RO skipped virt users + 1) , 
scalep + (RO skipped virt_users + 1) , 
RO lower bf + RO align, 
RO upper bf + RO_align, 
RO_virt__users 

) ; 

RO_upper_bf [ RO_align - 1 ] = 0; /* zero diagonal element */ 

RO lower bf += RO tcols; 
RO_upper_bf += RO_tcols; 

/* 

, * create ptov map[i] number of 32-element dot products involving 

M= * X bf [i] and corr_l bf [i] [j] where 0 < j < ptov_map[i] 

m */ ~ 



m 



hi 
U 



m 



gen R sums ( 
X bf , 

corr 1 bf; 
ptov map, 
R sumsO, 
tot_phys_users 

); 
/* 

* scale the results and create two output rows (1 per matrix) 
*/ 

gen R matrices ( 
R sumsO, 
bf scalep, 

Jf: inv scalep + (RO_skipped_virt_users + 1) , 

scalep, 
Rl trans_bf. 
Rim bf , 

tot_yirt__users 

); 

Rl trans bf Rl tcols; 
Rlm_bf Rl_tcols; 

corr 0 bf +- (RO virt users * NUM FINGERS SQUARED) ; 
corr 1 bf += (tot_virt_users * NUM_FINGERS_SQUARED) ; 
RO ptov map[i] -= 1; 
RO virt users -= 1; 
RO_skipped_virt_users += 1; 

start_virt_user =0; /* for all subsequent passes */ 

#if DO CALC STATS 

printf { "max R value = %f\n", max_R_value ); 
if ( max_R_value > 127.0 ) 

printf ( "***** OVERFLOW *****\n" ) ; 
#endif 
} 

#if COMPILE C 
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void gen X row ( 

COMPLEX BF16 *mpathl bf , 
COMPLEX BF16 *mpath2_bf, 
COMPLEX BF16 *X_bf, 
int phys index, 
int tot_phys_users 



COMPLEX BF16 *in mpathlp, *in Tnpath2p; 

C0MPLEX__BF16 *outjmpathlp, *out_nipath2p; 

int i, j, q, ql; 

BF32 sir, sli, s2r, s2i; 

BF32 air, ali, a2r, a2i; 

BF32 cr, ci; 



m 
u 



rii 



out mpathlp = mpathl bf + (phys index * NOM FINGERS) ; 
out_mpath2p = Tnpath2_bf + (phys_index * NUM_FINGERS) ; 

for ( i = 0; i < tot_phys_users ; i++ ) { 

in mpathlp = mpathl bf + (i * NUM FINGERS); /* 4 complex values */ 
in__mpath2p = mpath2_bf + (i * NXJM_FINGERS) ; /* 4 complex values */ 

j = 0; 

for ( ql = 0; ql < NUM_FINGERS; ql++ ) { 

sir = (BF32)out mpathlp [ql] .real 



(BF32)out mpathlp [ql] . imag 
(BF32)out mpath2p[ql] .real 



5li = 
s2r = 

s2i = (BF32) out_jnpath2p [ql] . imag; 

for ( q = 0; q < NUM_FINGERS; q++ ) 

air = {BF32) in mpathlp [q] . real; 

ali = (BF32)in mpathlp [q] . imag; 

a2r = {BF32)in mpath2p [q] .real; 

a2i = (BF32) in_mpath2p [q] . imag; 

cr = (air * sir) + (ali * sli) ; 
ci = (air * sli) - (ali * sir); 
cr += (a2r * s2r) + (a2i * s2i) ; 
ci += (a2r * s2i) - {a2i * s2r) ; 



X bf [i 
X bf [i 



NUM FINGERS SQUARED + j] .real 
NUM_FINGERS__SQUARED + j ] . imag 



(BF16) (cr » 16) 
(BF16) (ci » 16) , 



void gen R sums ( 

COMPLEX BF16 *X bf, 
COMPLEX BF8 *corr_bf , 
uchar *ptov map, 
BF32 *R sums, 
int num_j)hys_users 

) 



int i, j, Jc; 
BF32 sura; 

for ( i = 0; i < num phys users; i++ ) { 
for ( j = 0; j < (int) ptov_map [13 ; j++ ) { 
sum = 0; 
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} 



} 



for ( k = 0; k < 16; k++ ) { 
sum += (BF32)X bf [k] .real * 
sum += (BF32)X_bf [k] .imag * 
++corr_bf ; 

} 

*R_sums++ = sum; 
X_bf += NXJM_FINGERS__SQUARED; 



{BF32)corr bf->real; 
(BF32) corr_bf ->imag; 



void gen R sums2 ( 

COMPLEX BF16 *X bf , 
COMPLEX BF8 *corra bf , 
COMPLEX BF8 *corrb_bf , 
uchar *ptov map, 
BF32 *R sumsa, 
BF32 *R sumsb, 
int num_phys_users 

) 



\0 



IS 
W 



int i , j I k ; 
BF32 suma, sumb; 

for ( i = 0; i < num phys users; i++ ) { 
for ( j = 0; j < (int)ptov_map [i] ; j++ ) { 
suma = 0 ; 
sumb = 0; 

for ( k = 0; k < 16; k++ ) { 

suma += (BF32)X bf [k] . real * (BF32)corra bf->real; 
suma += {BF32)X bf [k] . imag * (BF32)corra bf->imag; 
sumb += (BF32)X bf [k] . real * (BF32)corrb bf->real; 
sumb += (BF32)X_bf Ek] .imag * {BF32) corrb_bf ->imag; 
++corra bf ; 
++corrb_bf ; 



} 

*R sumsa++ 
*R sumsb ++ 



suma ; 
sumb; 



} 



} 

X_bf += NUM_FINGERS_SQUARED; 



void gen R matrices ( 
BF32 *R sums, 
float *bf scalep, 
float *inv scalep, 
float *scalep, 
BF8 *no scale row bf , 
BF8 * scale row bf , 
int num_virt_users 

) 



int i ; _ 
float bf_scale, fsum, fsum_scale, inv_scale, scale; 

bf scale = *bf scalep; 
inv_scale = *inv_scalep; 

for ( i = 0; i < num_virt_users; i++ ) { 
scale = scalep [i]; 
fsum = (float) (R sums [i] ) ; 
fsum *= bf scale; 



fsum scale = fsum * inv_scale; 
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fsum_scale *= scale; 

#if DO CALC STATS 

UPDATE MAX( fsum scale, max R_value ) 

UPDATE_MAX( fsum, max_R__value ) 
#endif 

#if DO_SQUELCH 

if ( FABS( fsum_scale ) <= SQUELCH THRESH ) fsum scale = 0.0; 
if ( FABS( fsum ) <= SQUELCH__THRESH ) fsum = 0.0; 
#endif 

#if DO SATURATE 

SATURATE ( fsum_scale ) 

SATURATE { fsum ) 
#endif 

no scale row bf[i] = BF8 FIX( fsum ); 
scale row bf[i3 = BF8 FIX( fsum scale ); 

? #endif /* COMPILE_C */ 



^0 



in 



m 
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MC Standard Algorithms PPC Macro language Version 



Pile Name: dotpr3_8bit .mac 
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m 
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#include "salppc.inc" 

#define LVX_BT{ vT, rA, rB ) 

#define FXJNC ENTRY 
#define VMSUM( vT, vA, vB, vC ) 
#define LOOP COUNT SHIFT 6 
#define HALF BLOCK BIT 0x20 
#define QUARTER_BLOCK_BIT 0x10 

#def ine LOOP BLOCK SIZE 64 



LVX( vT, rA, rB ) 
dotpr3 8bit 

VMSUMMBM{ vT, vA, vB, vC ) 



f3 



Input parameters 

ttdefine btlmptr r3 
#define rlptr r4 
#define rOptr r5 
#define rlmptr r6 
#define C r7 
#define N r8 
# define hat_tc r9 
/** 

Local loop registers 

#define btOptr rlO 
#define btlptr rll 
#define indexl rl2 
#define index2 rl3 



#define index3 rO 
#define icount hat tc 



registers 



/ie* 

G4 
**/ 

#define rqlO vO 
#define rqll vl 
#define rql2 v2 
#define rql3 v3 
ttdefine zero v3 



#define rqOO v4 
#define rqOl v5 
#def ine rq02 v6 
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#define rq03 v7 

#define rqlmO v8 

#define rqlml v9 

#define rqlm2 vlO 

#define rqlm3 vll 

#define btlmO vl2 
#def ine btlml vl3 
#def ine btlm2 vl4 
#define btlm3 vl5 

#define btlO vl6 
#define btll vl7 
#define btl2 vl8 
#define btl3 vl9 

idefine btOO v20 
#define btOl v21 
ttdefine bt02 v22 
ttdefine bt03 v23 

#define sumO v24 
^""^ ttdefine suml v25 

f3 #define sum2 v26 

#define sum3 v27 



%y /** 

Begin code text 
m Setup loop registers, test for zero N 

* * / 

?S FUNC PROLOG ^ i- ^ \ 

111 ENTRY 7( FUNC_ENTRY, btlmptr, rlptr, rOptr, rlmptr, C, N, liat_tc ) 

T SAVE rl3 

USE_THRU_v2 7 ( VRSAVE_COND ) 
IjJ / * * 

\\\ Load up local loop registers 

**/ 

ADD(btOptr, btlmptr, hat_tc) 
ftp VXOR(sumO, sumO, sumO) 

ADD(btlptr, btOptr, hat_tc) 

it- LKindexl, 16) 

VXOR ( suml , suml , suml ) 
LI(index2, 32) 
VXOR ( sum2 , sum2 , sum2 ) 
LI{index3, 48) 

VXOR(sum3, sum3, sum3) 4- . * / 

SRWI Cdcount, N, LOOP_COUNT_SHIFT) /* 32 sum updates per loop trip */ 
BEQ (do_half __block) 

j -k-k 

Loop entry code 

•k-k I 

LVX( rqlO, 0, rlptr ) 
LVX( rqll, rlptr, indexl ) 
LVX{ rql2, rlptr, index2 ) 
LVX( rql3, rlptr, index3 ) 
DECR_C ( i count ) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (rlptr, rlptr, LOOP_BLOCK SIZE) 

LVX BT{ btlm2, btlmptr, index2 ) 

LVX BT( btlm3, btlmptr, index3 ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 

BR( mid_loop ) 
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Loop computes three dot products held in 16 parts 
* */ 

LABEL ( loop ) 

/* { */ 

LVX( rqlO, 0, rlptr ) 
VMSUM{ sumO, rqlmO , btlO, sumO ) 
LVX( rqll, rlptr, indexl ) 
VMSUM( suml, rqlml, btll, sutnl ) 
LVX{ rql2, rlptr, index2 ) 
VMSUM( sum2, rqlm2, btl2, sum2 ) 
LVX( rql3, rlptr, index3 ) 
DECR_C(icount) 

LVX BT( btltnO, 0, btltnptr ) 

VMSUM( sum3, rqlmS, btl3, sum3 ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (rlptr, rlptr, LOOP_BLOCK SIZE) 

LVX BT( btlTn2, btlmptr, index2 ) 

LVX BT( btlm3, btlmptr, index3 ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 

LABEL ( mid_loop ) 

LVX( rqOO, 0, rOptr ) 
VMSUM{ sumO, rqlO, btlmO, sumO ) 
LVX( rqOl, rOptr, indexl ) 
VMSUM( suml, rqll, btlml, suml ) 
LVX( rq02, rOptr, index2 ) 
VMSUM( sum2, rql2, btlm2, sum2 ) 
LVX( rq03, rOptr, index3 ) 

LVX BT( btOO, 0, btOptr ) 

VMSUM( sum3, rql3, btlm3, sum3 ) 

LVX BT( btOl, btOptr, indexl ) 

ADDKrOptr, rOptr, LOOP BLOCK SIZE) 

LVX BT{ bt02, btOptr, index2 ) 

LVX BT( bt03, btOptr, index3 ) 

ADDI {btOptr, btOptr, LOOP_BLOCK_SIZE) 

LVX{ rqlmO, 0, rlmptr ) 
VMSUM( sumO, rqOO, btOO, sumO ) 
LVX( rqlml, rlmptr, indexl ) 
VMSUMC suml, rqOl, btOl, suml ) 
LVX{ rqlm2, rlmptr, index2 ) 
VMSUM( sum2, rq02, bt02, sum2 ) 
LVX( rqlm3, rlmptr, index3 ) 

LVX BT( btlO, 0, btlptr ) 

VMSUM( sum3, rq03, bt03, sum3 ) 

LVX BT{ btll, btlptr, indexl ) 

ADDI (rlmptr, rlmptr, LOOP BLOCK_SIZE) 

LVX BT( btl2, btlptr, index2 ) 

LVX BT( btl3, btlptr, index3 ) 

ADDI (btlptr, btlptr, LOOP_BLOCK_SIZE) 

/* } */ 

BNE( loop ) 
^ ** 

Loop exit code 
**/ 

VMSUM( sumO, rqlmO , btlO, sumO ) 
VMSUM( suml, rqlml, btll, suml ) 
VMSUM( sum2, rqlm2, btl2, sum2 ) 
VMSUM( sum3, rqlm3 , btl3, sum3 ) 

/ * * 

Remainders 
* */ 

LABEL (do_half_block) 
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ANDI C( icount, N, HALF_BLOCK_BIT ) 

BEQ{do quarter block) 

LVX( rqlO, 0, rlptr ) 

LVX( rqll, rlptr, indexl ) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (rlptr, rlptr, (LOOP BLOCK SIZE » 1) ) 

ADDI (btlmptr, btlmptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumO, rqlO, btlmO, sumO ) 
VMSUM( suml, rqll, btlml, suml ) 

LVX( rqOD, 0, rOptr ) 

LVX( rqOl, rOptr, indexl ) 

LVX BT( btOO, 0, btOptr ) 

LVX BT( btOl, btOptr, indexl ) 

ADDKrOptr, rOptr, (LOOP BLOCK SIZE » 1) ) 

ADDKbtOptr, btOptr, (LOOP_BLOCK_SIZE » 1) ) 



VMSOMC sumO, rqOO, btOO, sumO ) 
VMSUM( suml, rqOl, btOl, suml ) 



f*1i LVX( rqlmO, 0, rlmptr ) 

|S LVX ( rqlml, rlmptr, indexl ) 

LVX BT{ btlO, 0, btlptr ) 

%Q LVX BT( btll, btlptr, indexl ) 

ADDI (rlmptr, rlmptr, (LOOP BLOCK SIZE » 1) ) 
ADDI (btlptr, btlptr, (LOOP__BLOCK_SIZE » 1) ) 

OJ VMSUM( sumO, rqlmO , btlO, sumO ) 

1*5 VMSUM( suml, rqlml, btll, suml ) 

^ LABEL (do quarter block) 

13 ANDI C( icount, N, QUARTER_BLOCK_BIT ) 

BEQ (combine) 

LVX( rqlO, 0, rlptr ) 
f'^ LVX BT{ btlmO, 0, btlmptr ) 

^p; VMSUM( sumO, rqlO, btlmO, sumO ) 

LVX( rqOO, 0, rOptr } 
ly LVX BT( btOO, 0, btOptr ) 

VMSUM( sumO, rqOO, btOO, sumO ) 

LVX( rqlmO, 0, rlmptr ) 
LVX BT( btlO, 0, btlptr ) 
VMSUM( sumO, rqlmO , btlO, sumO ) 

* 

Combine sums and return 

* 

LABEL (combine) 

VXOR ( zero, zero, zero ) 

VADDSWS( sumO, sumO, suml ) /* sOO sOl s02 s03 */ 
VADDSWS ( sum2, sum2, sumB ) /* s22 s21 s22 s23 */ 
VADDSWS ( sumO, sumO , sum2 ) /* sOO sOl s02 s03 */ 
VSUMSWS ( sumO, sumO, zero ) /* xxx xxx xxx sOO */ 
VSPLTW( sumO, sumO, 3 ) /* sOO sOO sOO sOO */ 

STVEWX( sumO, 0, C ) 

/** 

Return 
**/ 

LABEL ( ret ) 

FREE THRU__v27( VRSAVE_COND ) 

REST rl3 

RETURN 
FONC EPILOG 
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# include " salppc . inc " 

tdefine LVX_BT ( vT, rA, rB ) 

#define FUNC ENTRY 
#define VMSUM( vT, vA, vB, vC ) 
#define LOOP COUNT SHIFT 6 
#define HALF BLOCK BIT 0x2 0 
#define QUARTER_BLOCK_BIT 0x10 

#define LOOP_BLOCK_SIZE 64 

Input parameters 

#def ine btlmptr r3 

#define rlptr r4 

#define rOptr r5 

#define rlmptr r6 

#define C r7 

#define N r8 

ttdefine hat_tc r9 
/** 

Local loop registers 

#define btOptr rlO 
#define btlptr rll 
#def ine bt2ptr rl2 
#define indexl rl3 
#define index2 rl4 



LVX( vT, rA, rB ) 
dotprS 8bit 

VMSUMMBM{ vT, vA, vB, vC ) 



#define index3 rO 
#define i count hat_tc 

/** 

G4 registers 
* */ 

#define rqlO vO 
#define rqll vl 
#define rql2 v2 
#define rql3 v3 
ttdefine zero v3 

#define rqOO v4 
#define rqOl v5 
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#define rq02 v6 
#define rq03 v7 

#define rqlmO v8 
#define rqlml v9 
#define rqlm2 vlO 
ttdefine rqlmS vll 

#define btlmO vl2 
#define btlml vl3 
#define btlm2 vl4 
#define btlm3 vl5 

#define btlO vl2 
#define btll vl3 
#define btl2 vl4 
#define btl3 vl5 

#define btOO vl6 
#define btOl vl7 
ttdefine bt02 vl8 
#define bt03 vl9 

l'^ #define bt2 0 vl6 

CJ #define bt21 vl7 

£3 #define bt22 vlB 

#define bt2 3 vl9 

%0 #define sumOO v20 

fg #define sumOl v21 

i^: ttdefine sum02 v22 
#def ine sum03 v23 

ia 

s #define sumlO v24 

ffn, #define sumll v2 5 

14: #define suml2 v2 6 

ly #define suml3 v2 7 

^ Begin code text 

**/ 

13 FUNC PROLOG 

ENTRY 7( FUNC ENTRY, btlmptr, rlptr, rOptr, rlmptr, C, N, hat_tc ) 
SAVE rl3 rl4 

USE_THRU_v27 { VRSAVE_COND ) 
/ ** 

Load up local loop registers 

* 

ADD(btOptr, btlmptr, hat tc) 
VXOR ( sumO 0 , sumO 0 , sumO 0 ) 
ADD(btlptr, btOptr, hat_tc) 
LKindexl, 16) 
ADD{bt2ptr, btlptr, hat_tc) 

VXOR(suni01, sumOl, sumOl) 
LI(index2, 32) 
VXOR(sum02, sum02 , sum02) 
LI(index3, 48) 
VXOR(sum03, sum03, sum03) 

VXOR { suml 0 , suml 0 , suml 0 ) 
VXOR (sumll, sumll, sumll) 
VXOR (suml 2, suml 2, suml 2) 
VX0R(suml3, suml3, suml3) 
SRWI C(icount, N, LOOP_COUNT_SHIFT) 
BEQ (do_half _block) 

Loop entry code 
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**/ 

LVX BT( btlmO, 0, btlmptr ) 
DECK C(icount) 

LVX BT( btlml, btlmptr, indexl ) 
LVX BT( btlm2, btlmptr, index2 ) 
LVX_BT{ btlm3, btlmptr, index3 ) 

LVX( rqlO, 0, rlptr ) 
LVX( rqll, rlptr, indexl ) 
ADDKbtlmptr, btlmptr, LOOP_BLOCK_SIZE) 
LVX( rql2, rlptr, index2 ) 
LVX( rql3, rlptr, index3 ) 
BR{ mid_loop ) 

/ ** 

Loop computes three dot products held in 16 parts 
* */ 

LABEL ( loop ) 

/* { */ 

LVX BT( btlmO, 0, btlmptr ) 

VMSUM( sumlO, rqlmO , bt20, sumlO ) 

LVX BT( btlml, btlmptr, indexl ) 

VMSUM( sumll, rqlml, bt21, sumll ) 
, LVX BT( btlm2, btlmptr, index2 ) 

1*^ DECR C(icount) 

Q VMSUM( suml2, rqlm2, bt22, suml2 ) 

f*:& LVX BT( btlm3, btlmptr, index3 ) 

^ LVX( rqlO, 0, rlptr ) 

VMSUM( suml3, rqlm3 , bt23, suml3 ) 

LVX{ rqll, rlptr, indexl ) 

LVX( rql2, rlptr, index2 ) 

ADDI (bt2ptr, bt2ptr, LOOP_BLOCK_SIZE) 
IJ LVX( rql3, rlptr, index3 ) 

ADDKbtlmptr, btlmptr, LOOP_BLOCK__SIZE) 



LABEL ( mid_loop ) 

LVX BT( btOO, 0, btOptr ) 
VMSUM( sumOO, rqlO, btlmO, sumOO ) 
LVX BT( btOl, btOptr, indexl ) 
VMSUM( sumOl, rqll, btlml, sumOl ) 
LVX BT( bt02, btOptr, index2 ) 
VMSUM( sum02, rql2, btlm2, sum02 ) 
LVX BT( bt03, btOptr, index3 ) 
ADDKrlptr, rlptr, LOOP_BLOCK_SIZE) 

LVX( rqOO, 0, rOptr ) 

VMSUM{ sum03, rql3, btlm3, sum03 ) 

LVX( rqOl, rOptr, indexl ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) 

LVX( rq02, rOptr, index2 ) 

VMSUM( sumll, rqll, btOl, sumll ) 

ADDI (btOptr, btOptr, LOOP BLOCK SIZE) 

VMSUM( suml2, rql2, bt02, suml2 ) 

LVX( rq03, rOptr, index3 ) 

VMSUM( suml3, rql3, bt03, suml3 ) 

LVX BT( btlO, 0, btlptr ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) 

LVX BT( btll, btlptr, indexl ) 
ADDKrOptr, rOptr, LOOP BLOCK_SIZE) 

LVX BT( btl2, btlptr, index2 ) 

VMSUM{ sumOl, rqOl, btOl, sumOl ) 

LVX BT( btl3, btlptr, index3 ) 

VMSUM( sum02, rq02, bt02, sum02 ) 

LVX( rqlmO, 0, rlmptr ) 
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VMSUM{ sum03 
ADDI (btlptr, 
VMSUM( sumlO 
LVX( rqlml, 
VMSUM( sumll 
LVX( rqlTn2, 
VMSUM( suml2 
LVX( rqlin3, 
ADDI (rlmptr. 
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m 



rq03; bt03, suTn03 ) 
btlptr, LOOP BLOCK SIZE) 
rqOO, btlO, sutnlO ) 
rlmptr, indexl ) 

rqOl, btll, sumll ) 
rlmptr, index2 ) 

rq02, btl2, suml2 ) 
rlmptr, index3 ) 
rlmptr , LOOP_BLOCK_SI 2E) 



LVX BT( bt20, 0, bt2ptr ) 

VMSUM( suml3, rq03, btl3, suml3 ) 

LVX BT{ bt21, bt2ptr, indexl ) 

VMSUM( sumOO, rqlmO, btlO, sumOO ) 

LVX BT( bt22, bt2ptr, index2 ) 

VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX BT{ bt23, bt2ptr, index3 ) 

VMSUM{ sum02, rqlm2 , btl2, sum02 ) 

VMSUM( sum03, rqlm3, btl3, sum03 ) 

/* } */ 

BNE( loop ) 
/** 

Loop exit code 
**/ 

VMSUM{ sumlO, rqlmO , bt2 0, sumlO ) 
VMSUM( sumll, rqlml, bt21, sumll ) 
ADDI(bt2ptr, bt2ptr, LOOP_BLOCK_SIZE) 
VMSUM( suml2, rqlm2, bt22, suml2 ) 
VMSUM( suml3, rqlm3, bt23, suml3 ) 

/ *ie 

Remainders 

* 

LABEL (do half block) 

ANDI C( icount, N, HALF_BLOCK_BIT ) 
BEQ (do_quarter_block) 

LVX BT{ btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDKbtlmptr, btlmptr, {LOOP_BLOCK_SIZE » 1) ) 

LVX( rqlO, 0, rlptr ) 
LVX( rqll, rlptr, indexl ) 

ADDI (rlptr, rlptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) 
VMSUM( sumOl, rqll, btlml, sumOl ) 

LVX BT( btOO, 0, btOptr ) 

LVX BT{ btOl, btOptr, indexl ) 

ADDKbtOptr, btOptr, (LOOP_BLOCK__SIZE » 1) ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) 
VMSUM( sumll, rqll, btOl, sumll ) 

LVX( rqOO, 0, rOptr ) 
LVX( rqOl, rOptr, indexl ) 

ADDKrOptr, rOptr, {LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) 
VMSUM( sumOl, rqOl, btOl, sumOl ) 

LVX BT( btlO, 0, btlptr ) 

LVX BT( btll, btlptr, indexl ) 

ADDI (btlptr, btlptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM( sumlO, rqOO, btlO, sumlO ) 
VMSUM( sumll, rqOl, btll, sumll ) 
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LVX( rqlmO, 0, rlmptr ) 
LVX( rqlml, rlmptr, indexl ) 

ADDI (rlmptr, rlmptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSXJIM( sumOO, rqlmO, btlO, sumOO ) 
VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX BT( bt20, 0, bt2ptr ) 

LVX BT( bt21, bt2ptr, indexl ) 

ADDI{bt2ptr, bt2ptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM{ sumlO, rqlmO, bt20, sumlO ) 
VMSUM{ sumll, rqlml, bt21, sumll ) 

LABEL (do quarter block) 

ANDI C{ icount, N, QUARTER_BLOCK_BIT ) 
BEQ (combine) 

LVX BT( btlmO, 0, btlmptr ) 

LVX( rqlO, 0, rlptr ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) 

LVX BT( btOO, 0, btOptr ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) 

LVX( rqOO, 0, rOptr ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) 



LVX BT( btlO, 
VMSUM( sumlO, 
LVX( rqlmO, 0 
VMSUM( sumOO, 
LVX BT( bt20. 



0, btlptr ) 
rqOO, btlO, sumlO ) 
, rlmptr ) 

rqlmO, btlO, sumOO ) 
0, bt2ptr ) 



VMSUM( sumlO, rqlmO, bt20, sumlO ) 

Combine sums and return 
**/ 

LABEL (combine) 

VXOR( zero, zero, zero ) 

VADDSWS( sumOO, sumOO, sumOl ) 

VADDSWS( sumlO, sumlO, sumll ) 

VADDSWS( sum02, sum02, sum03 ) 

VADDSWS( suml2, suml2, suml3 ) 

VADDSWS( sumOO, sumOO, sum02 ) 

VADDSWS( sumlO, sumlO, suml2 ) 



/* 
/* 
/* 



sOO sOl s02 s03 */ 
S22 s21 s22 S23 */ 
sOO sOl s02 s03 */ 



VSUMSWS( sumOO, sumOO, zero ) 

VSUMSWS( sumlO, sumlO, zero ) 

VSPLTWC sumOO, sumOO, 3 ) 

STVEWX( sumOO, 0, C ) 

ADDK C, C, 4 ) 

VSPLTW( sumlO, sumlO, 3 ) 

STVEWX( sumlO, 0, C ) 

/** 

Return 
* */ 

LABEL ( ret ) 

FREE THRU v27 ( VRSAVE_COND ) 

REST rl3_rl4 

RETURN 
FUNC EPILOG 



/* XXX XXX XXX sOO */ 
/* sOO sOO sOO sOO */ 
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#include "salppc.inc" 

#define LVX_BT ( vT, rA, rB ) 

#define FUNG ENTRY 
#define VMSUM{ vT, vA, vB, vC ) 
#define LOOP COUNT SHIFT 6 
tdefine HALF BLOCK BIT 0x2 0 
#define QUARTER_BLOCK_BIT 0x10 

#define LOOP BLOCK SIZE 64 



LVX{ vT, rA, rB ) 
dotpr9 8bit 

VMSUMMBM{ vT, vA, vB, vC ) 



rii 



Input parameters 
**/ 

#define btlmptr r3 
#define rlptr r4 
#define rOptr r5 
#define rlmptr r6 
tdefine C r7 
#define N r8 
#define hat_tc r9 
/** 

Local loop registers 
**/ 

#define btOptr rlO 
#define btlptr rll 
ttdefine bt2ptr rl2 
#def ine bt3ptr rl3 
#define indexl rl4 
#define index2 rl5 



#define indexB rO 
#define icount hat_tc 

/ * * 

G4 registers 
* + / 

#define rqlO vO 
#define rqll vl 
#define rql2 v2 
#define rql3 v3 
#define zero v3 



#define bt3 0 vO 



1 
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#clefine bt31 vl 
#define bt32 v2 
#define bt33 v3 

#define rqOO v4 
#define rqOl v5 
#define rq02 v6 
#define rq03 v7 

#define rqlmO v8 

#define rqlml v9 

#define rqlm2 vlO 

#define rqlm3 vll 

#define btlmO vl2 
#define btlml vl3 
#define btlm2 vl4 
idefine btlin3 vl5 



#define btlO vl2 

#define btll vl3 

#define btl2 vl4 

#define btl3 vl5 



W #define btOO vl6 

13 #define btOl vl7 

#define bt02 vl8 
#define bt03 vl9 



t$ #define bt20 vl6 

fv^ #define bt21 vl7 

f^' #define bt22 vl8 

W #define bt23 vl9 

#define sumOO v2 0 
#define sumOl v21 
llJ #define sum02 v22 

#define sum03 v23 

*P #define sumlO v24 

Q #define sumll v25 

=11 #define suTnl2 v26 

#define suml3 v27 

#define sum20 v28 
#define sum21 v29 
#define sum22 v30 
#define sum23 v31 



/** 

Begin code text 
** / 

FUNC PROLOG 

ENTRY 7( FUNC ENTRY, btlmptr, rlptr, rOptr, rlmptr, C, N, hat_tc ) 
SAVE rl3 rl5 

USE_THRU__v31 ( VRSAVE_COND ) 
^ * * 

Load up local loop registers 
**/ 

ADD(btOptr, btlmptr, hat tc) 
VXOR{sumOO, sumOO, sumOO) 
ADD(btlptr, btOptr, hat_tG) 
LI{indexl, 16) 
ADD(bt2ptr, btlptr, hat tc) 
VXOR{sum01, sumOl, sumOl) 
ADD(bt3ptr, bt2ptr, hat_tc) 
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LI(index2, 32) 
VXOR { sura02 , sutn02 , 
LI(index3, 48) 
VXOR ( sumO 3 , sumO 3 , 



2/23/2001 



sum02) 
sum03) 



VXOR ( suml 0 , suml 0 , sutnl 0 ) 

VXOR(suTnll, sumll, sumll) 

VX0R(suTnl2, suml2, suml2) 

VXOR ( suml 3 , suml 3 , suml 3 ) 



m 



y 



m 



VXOR ( sum2 0 , sum2 0 , sum2 0 ) 

VXOR { sum2 1 , sum2 1 , sum2 1 ) 

VXOR ( sum2 2 , sum2 2 , sum2 2 ) 

VXOR ( sum2 3 , sum2 3 , sum2 3 ) 

SRWI Cdcount, N, LOOP_COUNT_SHIFT) 

BEQ (do_half _block) 

/ ** 

Loop entry code 
* ★/ 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

DECR C(icount) 

LVX BT( btlm2, btlmptr, index2 ) 
LVX_BT( btlm3, btlmptr, index3 ) 

LVX( rqlO, 0, rlptr ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 
LVX( rqll, rlptr, indexl ) 
LVX( rql2, rlptr, index2 ) 
LVX( rql3, rlptr, index3 ) 
LVX_BT( btOO, 0, btOptr ) 
BR( mid_loop ) 



Nine dot products producing 3 sums : 
sumO = (Rl * Btlm) (RO * BtO) (Rim * Btl) 
BtO) (RO * Btl) (Rim * Bt2) 
Btl) (RO * Bt2) (Rim * Bt3) 



/* Rim * Bt3 */ 



suml = (Rl 
sum2 = (Rl 

* * y 

LABEL ( loop ) 

/* { */ 

LVX BT( btlmO, 0, btlmptr ) 
VMSUM( sum20, rqlmO, bt30, sum20 ) 
LVX BT( btlml, btlmptr, indexl ) 
VMSUM( sum21, rqlml, bt31, sum21 ) 
LVX BT( btlm2, btlmptr, index2 ) 
VMSUM( sum22, rqlm2, bt32, sum22 ) 
LVX_BT( btlm3, btlmptr, index3 ) 

LVX( rqlO, 0, rlptr ) 
VMSUM( sum23, rqlm3 , bt33, sum23 ) 
ADDKbtlmptr, btlmptr, LOOP_BLOCK_SIZE) 
LVX( rqll, rlptr, indexl ) 

VMSUM( sum20, rqOO, bt20, sum20 ) /* RO * Bt2 */ 
LVX( rql2, rlptr, index2 ) 
VMSUM{ sum21, rqOl, bt21, sum21 ) 
DECR C(icount) 

VMSUM( sum22, rq02, bt22, sum22 ) 
LVX( rql3, rlptr, index3 ) 

VMSUM( sum23, rq03, bt23, sum23 ) 
LVX_BT( btOO, 0, btOptr ) 

/ * * 

Loop entry 

* *y 

LABEL ( mid_loop ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) /* Rl * Btlm */ 
LVX_BT( btOl, btOptr, indexl ) 
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I* 

13 



m 



ADDKrlptr, rlptr, LOOP BLOCK SIZE) 
LVX BT( bt02, btOptr, index2 ) 
VMSUM( sumOl, rqll, btlml, sumOl ) 
LVX_BT( bt03, btOptr, index3 ) 

VMSUM( sum02, rql2, btlm2, sum02 ) 

LVX( rqOO, 0, rOptr ) 

VMSUM( sumOa, rql3, btlmS, sumOS ) 

ADDKbtOptr, btOptr, LOOP__BLOCK_SIZE) 

VMSUM( sumlO, rqlO, btOO, sumlO ) /* Rl * BtO */ 

LVX( rqOl, rOptr, indexl ) 

VMSUM( sumll, rqll, btOl, sumll ) 

LVX( rq02, rOptr, index2 ) 

VMSUM( suml2, rql2, bt02, suml2 ) 

LVX( rq03, rOptr, index3 ) 

ADDKrOptr, rOptr, LOOP BLOCK SIZE) 
VMSUM( sumlB, 
LVX BT( btlO, 
LVX BT( btll, 
VMSUM{ sumOO, 
LVX BT( btl2. 



suTnl3 ) 



rql3, bt03. 
0, btlptr ) 
btlptr, indexl ) 
rqOO, btOO, sumOO 
btlptr, index2 ) 
VMSUM( sumOl, rqOl, btOl, sumOl 
LVX_BT( btl3, btlptr, index3 ) 



} /* RO * BtO */ 



) 



) 



VMSUM( suTn02, rq02, bt02, sum02 
VMSUMl sum03, rq03, bt03, sum03 
LVX{ rqlmO, 0, rlmptr ) 
VMSUMC sum20, rqlO, btlO, sum20 
LVX( rqlml, rlmptr, indexl ) 
VMSUM( sum21, rqll, btll, sum21 
LVX ( rqlm2, rlmptr, index2 ) 
ADDI {btlptr, btlptr, LOOP BLOCK_SIZE) 
LVX( rqlm3, rlmptr, index3 ) 



) 

) /* Rl * Btl */ 

) 



rql2, btl2, sum22 ) 
0, bt2ptr ) 
rql3, btl 3, sum23 
bt2ptr, indexl ) 
rqOO, btlO, sumlO 
ADDKrlmptr, rlmptr, LOOP_BLOCK_SIZE) 
VMSUM( sumll, rqOl, btll, sumll ) 
bt2ptr, index2 ) 
rq02, btl2, suml2 ) 
bt2ptr, index3 ) 



VMSUM( sum22, 
LVX BT{ bt20, 
VMSUM( sum23, 
LVX BT( bt21, 
VMSUM( sumlO, 



) 

) /* RO * Btl */ 



LVX BT( bt22, 
VMSUM( suml2, 
LVX BT( bt23. 



VMSUM( suml3, rq03, btl3, suml3 ) 

LVX BT( bt30, 0, bt3ptr ) 

LVX BT( bt31, bt3ptr, indexl ) 

VMSUM( sumOO, rqlmO, btlO, sumOO ) /* Rim * Btl */ 

LVX BT( bt32, bt3ptr, index2 ) 

VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX_BT( bt33, bt3ptr, index3 ) 



VMSUM( sum02, rqlm2 , btl2, sum02 ) 

VMSUM( sum03, rqlm3 , btl3, sum03 ) 

ADDI{bt2ptr, bt2ptr, LOOP BLOCK SIZE) 

VMSUM( sumlO, rqlmO, bt2 0, sumlO ) /* Rim 

VMSUM( sumll, rqlml, bt21, sumll ) 

ADDI (bt3ptr, bt3ptr, LOOP BLOCK SIZE) 

VMSUM( suml2, rqlm2 , bt22, suml2 ) 

VMSUM{ suml3, rqlm3, bt23, suml3 ) 

/* } */ 

BNE( loop ) 

/ ** 

Loop exit code 



Bt2 */ 
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VMSUM( sum2 0, 

VMSUM{ sum21, 

VMSUM( sum22, 

VMSUM( sum23, 

VMSUM{ sum2 0, 

VMSUM( suTn21, 

VMSUM( sum22, 

VMSUM( sum23. 
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1*^ 



m 



id 



m 



rqlmO, bt30, sum20 ) /* Rim * Bt3 */ 

rqlml, bt31, sum21 ) 

rqlm2, bt32, sum22 ) 

rqlm3, bt33, sum23 ) 
rqOO, bt20, 



rqOl, 
rq02, 
rq03, 



bt21, 
bt22, 
bt23. 



sum20 
sum21 
sum22 
sum23 



) /* 
) 
) 
) 



RO * Bt2 */ 



Remainders 
**/ 

IjABEL(do half block) 

ANDI C( icount, N, HALF_BLOCK_BIT ) 
BEQ {do__quarter_block) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (btlmptr, btlmptr, (LOOP_BLOCK__SIZE » 1) ) 

LVX( rqlO, 0, rlptr ) 
LVX( rqll, rlptr, indexl ) 

ADDKrlptr, rlptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) /* Rl * Btlm */ 
VMSUM( sumOl, rqll, btlml, sumOl ) 

LVX BT( btOO, 0, btOptr ) 

LVX BT( btOl, btOptr, indexl ) 

ADDKbtOptr, btOptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) /* Rl * BtO */ 
VMSUM{ sumll, rqll, btOl, sumll ) 

LVX( rqOO, 0, rOptr ) 
LVX{ rqOl, rOptr, indexl ) 

ADDKrOptr, rOptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM{ sumOO, rqOO, btOO, sumOO ) /* RO * BtO */ 
VMSUM{ sumOl, rqOl, btOl, sumOl ) 

LVX BT( btlO, 0, btlptr ) 

LVX BT( btll, btlptr, indexl ) 

ADDI (btlptr, btlptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sum20, rqlO, btlO, sum20 ) /* Rl * Btl */ 

VMSUM( sum21, rqll, btll, sum21 ) 

VMSUM( sumlO, rqOO, btlO, sumlO ) /* RO * Btl */ 

VMSUM{ sumll, rqOl, btll, sumll ) 

LVX( rqlmO, 0, rlmptr ) 
LVX( rqlml, rlmptr, indexl ) 

ADDKrlmptr, rlmptr, (LOOP_BLOCK_SIZE >> 1} ) 

VMSUM( sumOO, rqlmO, btlO, sumOO ) /* Rim * Btl */ 
VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX BT( bt20, 0, bt2ptr ) 

LVX BT( bt21, bt2ptr, indexl ) 

ADDI(bt2ptr, bt2ptr, {LOOP_BLOCK_SIZE » 1) ) 

VMStJM( sum20, rqOO, bt20, sum20 ) /* RO * Bt2 */ 

VMSUM( sum21, rqOl, bt21, sum21 ) 

VMSUM{ sumlO, rqlmO, bt2 0, sumlO ) /* Rim * Bt2 */ 

VMSUM{ sumll, rqlml, bt21, sumll ) 
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LVX BT( bt3 0, 0, btSptr ) 

LiVX BT( btSl, btSptr, indexl ) 

ADDI(bt3ptr, bt3ptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sum20, rqlmO , bt30, sum20 ) /* Rim * Bt3 */ 
ViyiSUM( sum21, rqlml, bt31, sum21 ) 

/** 

four more sums 
**/ 

LABEL (do quarter block) 

ANDI C( icount, N, QUARTER_BLOCK_BIT ) 
BEQ ( combine ) 

LVX BT( btlmO, 0, btlmptr ) 
LVX( rqlO, 0, rlptr ) 

VMSUM{ sumOO, rqlO, btlmO, sumOO ) /* Rl * Btlm */ 
ADDI {btlmptr, btlmptr, 16) 

LVX BT( btOO, 0, btOptr ) 

VMSUM{ sumlO, rqlO, btOO, sumlO ) /* Rl * BtO */ 

LVX( rqOO, 0, rOptr ) 
J VMSUM( sumOO, rqOO, btOO, sumOO ) /* RO * BtO */ 

Q LVX_BT( btlO, 0, btlptr ) 



Id 



m 



VMSUM( sura20, rqlO, btlO, sum20 ) /* Rl * Btl */ 
VMSUM( sumlO, rqOO, btlO, sumlO ) /* RO * Btl */ 



LVX{ rqlmO, 0, rlmptr ) 

VMSUM( sumOO, rqlmO, btlO, sumOO ) /* Rim * Btl */ 



m 

m 

I J LVX BT( bt2 0, 0, bt2ptr ) 

VMSUM{ sum20, rqOO, bt20, sum20 ) /* RO * Bt2 */ 

VMSUM{ sumlO, rqlmO, bt2 0, sumlO ) /* Rim * Bt2 */ 



LVX BT( bt3 0, 0, bt3ptr ) 

VMSUM( sum20, rqlmO, bt30, sum20 ) /* Rim * Bt3 */ 

/** 

Combine sums and return 



^ste": * * / 



LABEL (combine) 

VXOR( zero, zero, zero ) 



VADDSWS 
VADDSWS 
VADDSWS 

VADDSWS 
VADDSWS 
VADDSWS 

VADDSWS 
VADDSWS 

VADDSWS 

VSUMSWS 
VSUMSWS 
VSUMSWS 



sumOO, sumOO, sumOl ) 

sumlO, sumlO, sumll ) 

sum20, sum2 0, sum21 ) 

sum02, sum02, sum03 ) 

suml2, suml2, suml3 ) 

sum22, sum22, sum23 ) 

sumOO, sumOO, sum02 ) 

sumlO, sumlO, suml2 ) 

sum2 0, sum2 0, sum22 ) 

sumOO, sumOO, zero ) /* xxx xxx xxx sOO */ 

sumlO, sumlO, zero ) 

sum20, sum20, zero ) 



VSPLTW( sumOO, sumOO, 3 ) /* sOO sOO sOO sOO */ 

STVEWX( sumOO, 0, C ) 

ADDK C, 4 ) 

VSPLTW( sumlO, sumlO, 3 ) 

STVEWXC sumlO, 0, C ) 

ADDK C, C, 4 ) 

VSPLTW( sum20, sum20, 3 ) 

STVEWX( sum20, 0, C ) 
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Return 



LABEL ( ret ) 

FREE THRU v31{ VRSAVE_COND ) 

REST rl3__rl5 

RETURN 
FUNC EPILOG 



m 
m 
w 




7 



Page No. 210 



EV 093 931 797 US 
Page No. 237 

f ixed_cdotpr . mac 



2/23/2001 



m 

U 



ru 



#ifndef MCOS 55 
#define MCOS_55 0 
#endif 

/* 

/* 

- MC Standard Algorithms 



603e Macro language Version 



File Name: CDOTPR.MAC 

Description: Vector Single Precision Complex Dot Product 
Entry /params : CDOTPR (A, I, B, J, C, N) 
Formula: C[0] = sum (A[mI]*B[mJ] - A [mI + 1] *B [mJ+1] ) 
C[l] = sum (A[ml3*B[mJ+l] + A[ml+1] *B [mJ] ) 
for m=0 to N-1 

Mercury Computer Systems, Inc. 
Copyright (c) 1995 All rights reserved 

Engineer Reason 



Revision 


Date 


0.0 


960502 


0.1 


960618 


0.2 


970128 


0.3 


970203 


0.4 


970522 


0.5 


980325 


0.6 


980404 


0.7 


980708 


0.8 


980820 


0.9 


981019 


0.10 


981025 


0.11 


990310 


0.12 


990730 


1.0 


000223 


1.1 


000305 


1.2 


000607 


1.3 


000610 



fpl Created 

fpl Added Esal entry 

fpl Added debt logic 

fpl Corrected ABIT define 

jfk Added new dcbx test macros 

fpl Added 74 0 code segment 

fpl Removed loop stall 

fpl Added build macros 

jfk Added new DCBT macro 

fpl Added z function 

fpl Modified z entry 

fpl 750/G4 integration 

fpl Added conjugate entry 

fpl Increased minimum VMX count 

jfkremoved branches to entrypoints 

jfk Fixed floating point save bug 

fpl Added new API macro 



#include "salppc.inc" 
#undef BR IF VMX Z2 

#define BR_IF_VMX_Z2 ( root_name, uroot name, min n imm, unit_s_imm, \ 

prl, pil, si, pr2, pi2, s2, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit z_skip vmx; \ 



\ 



\ 



cmpwi si, unit s imm; 
bne z_skip vmx; \ 
cmpwi s2, unit s imm; 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pr2, pi2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; 
xor rO , prl , pr2 ; 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; 
bne z unaligned vmx; 
BR VMX Z2 ( root_name, 
z_unaligned vmx: \ 

BR VMX Z2 ( uroot_name, eflag, si ) \ 
z_skip__vmx : 

#define ACOND 5 
#define ABIT 2 
#define BCOND 6 



\ 



\ 



eflag, si ) \ 
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#define BBIT 1 
/** 

API registers 
**/ 

#define A r3 
#define I r4 
#define B r5 
#define J r6 
#define C r7 
#define N r8 
#define EFLAG r9 

/** 

z input args 
* */ 

#define Ar A 

#define Ai rlO 

#define Br B 

#define Bi rll 

#define Cr C 

#define Ci rl2 

13 Local registers 

P #define count rl3 
#define rtmp rl3 
^ #define next line rl4 

® /** 

i^D- Fpu registers 

y **/ 

^ #define rsumrO fO 

#define rsumiO fl 
W #define isumrO f2 

III #define isumiO f3 

#define arO f4 
*C #define aiO f5 

fHi #define arl f6 

#define ail f7 
#define ar2 f8 
#define ai2 f9 
#define ar3 flO 
#define ai3 fll 



n 



#define brO fl2 
#define biO fl3 
#define brl fl4 
#define bil fl5 
#define br2 fl6 
#define bi2 fl7 
ttdefine br3 fl8 
tdefine bi3 fl9 

#if defined ( BUILD_MAX ) 
#if MCOS 55 

DECLARE_VMX_Z2 ( _zdotpr_VTTix:_cc ) 

#else 

DECLARE_VMX__Z2 ( _zdotpr_vmx ) 

#endif 

DECLARE_VMX_Z2 { _zdotpr4_vmx ) 
#endif 



/if* 

Code text: Conjugate 
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ry 



FUNC PROLOG 
#ifndef COMPILE C 
U_ENTRY{ fixed cidotpr ) 

FORTRAN DREF 3{ I, J, N ) 
U_ENTRY( fixed cidotpr ) 

LI ( EFLAG, SAL NNN ) 

BR( cidotprx common ) 
U_ENTRY{ fixed cidotprx ) 

FORTRAN DREF 4{ I, J, N, EFLAG ) 
U ENTRY ( fixed cidotprx ) 
LABEL ( cidotprx common ) 

ADDK Ai, Ar, 4 ) 

MR( Bi, Br ) 

ADDI ( Br, Br, 4 ) 

MR( Ci, Cr ) 

ADDI{ Cr, Cr, 4 ) 

BR ( common ) 

/* * 

Normal 
*★/ 

FUNC PROLOG 
#ifndef COMPILE C 
U_ENTRY( fixed cdotpr_ ) 

FORTRAN DREF 3( I, J, N ) 
U_ENTRY{ fixed cdotpr ) 
LI( EFLAG, SAL NNN ) 
BR( cdotprx common ) 
U_ENTRY{ fixed cdotprx ) 

FORTRAN DREF 4( I, J, N, EFLAG ) 
U ENTRY ( fixed cdotprx ) /* C 

LABEL { cdotprx common ) /* 
4 ) 
4 ) 
4 ) 



/* Fortran SAL */ 

/* C SAL */ 
/* NNN EFLAG (default) */ 
/ * common path * / 

/* Fortran ESAL */ 

/* C ESAL */ 

/* common path */ 



/* common path */ 



/* 



/* Fortran SAL */ 

/* C SAL */ 
NNN EFLAG (default) * 
common path */ 

/* Fortran ESAL */ 



/ 



ESAL */ 
common path 



/* common path */ 
Conjugate 



) 



N ) 



) 

N, 
A, 



/* Fortran SAL */ 

/* C SAL */ 
/* NNN EFLAG (default) */ 

/* Fortran ESAL */ 

EFLAG ) 

I, B, J, C, N, EFLAG) 



ADDI ( Ai, Ar, 
ADDI ( Bi, Br, 
ADDK Ci, Cr, 
BR ( common ) 

/** 

Split complex entries 

U_ENTRY( fixed zidotpr 

FORTRAN DREF 3( I, J, 
U_ENTRY( fixed zidotpr ) 
LI ( EFLAG, SAL NNN ) 
BR( zidotprx common ) 
U_ENTRY( fixed zidotprx 
F0RTRAN_DREF_4 ( I, J, 
#endif 

ENTRY 7( fixed zidotprx, 
LABEL { zidotprx_common ) 
/ ** 

Assign split complex pointers, do the conjugate trick 
* *y 

LWZ ( 
LWZ( 
LWZ( 
LWZ( 
LWZ( 
LWZ( 

BR ( z_common 

/** 
Normal 

U_ENTRY( fixed zdotpr_ ) 

FORTRAN DREF 3{ I, J, 
U_ENTRY( fixed zdotpr ) 

LI( EFLAG, SAL l^INN ) 

BR{ zdotprx_common ) 



Ai, 
Ar, 
Bi, 
Br, 
Ci, 
Cr, 



A, 
A, 
B, 
B, 
C. 
C, 



4 ) 

0 ) 

0 ) 

4 ) 

0 ) 

4 ) 
) 



/* Fortran SAL */ 

/* C SAL */ 
/* NNN EFLAG (default) * 



3 



Page No. 213 



EV 093 931 797 US 

Page No. 240 2/23/2001 

f ixed_cdotpr.mac 

U_ENTRY( fixed zdotprx ) /* Fortran ESAL */ 

F0RTRAN_DREF_4 ( I, J, N, EFIiAG ) 

#endif 
/** 

C ESAL 
**/ 

ENTRY 7( fixed zdotprx. A, I, B, J, C, N, EFLAG) 
DECLARE rlO rl4 
DECLARE_fO_fl9 

LABEL ( zdotprx_coTnmon ) 

j-k-k 

Assign split complex pointers 

**/ + / 

LWZ( Ai, A, 4 ) /* must load imag first since Ar reg = A reg */ 

LWZ( Ar, A, 0 ) 

LWZ( Bi, B, 4 ) 

LWZ( Br, B, 0 ) 

LWZ( Ci, C, 4 ) 

LWZ( Cr, C, 0 ) 

VMX API filter , , 

Test if okay to enter VMX code and branch to VMX code 
VMX loop - process all N points 

O **/ 

LABEL ( z__coramon ) 
|| #if defined { BUILD_MAX ) 



m 



#define MIN VMX N 2 0 



W #define UNIT_STRIDE 1 

13 ^''^R^IF VMX Z2( zdotpr_vTnx cc, zdotpr4_vmx, MINVMX_N, UNIT_STRIDE, \ 
IJ " ~ " Ar, Ai, I, Br, Bi, J, N, EFLAG ) 

1^. #else zdotpr_vmx, zdotpr4_vmx, MIN_VMX_N, UNIT_STRIDE, \ 

*P ~ ~ ~ Ar, Ai, I, Br, Bi, J, N, EFLAG ) 

13 #endif 



#endif /* BUILD_MAX */ 

Point of common path where all entries join 
Test for small counts 
**/ 

LABEL ( common ) 
SAVE rl3 rl4 
SAVE fl4 fl9 
CMPLWKN, 0) 
BEQ(ret) 
CMPLWKN, 1) 
BEQ(dol) 
CMPLWKN, 2) 
BEQ(do2) 
CMPLWKN, 3) 
BEQ{do3) 

/ * * 

check for uncached (and local) vectors 

SET_2_DCBT_COND { ACOND, ABIT, BCOND, BBIT, EFLAG, rtmp ) 

LKnextline, 32) 
740 code segment, start up loop code 

4 
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#if defined ( BUILD 750 

LFS( arO, Ar, 0 ) 

SRWI ( count, N, 2 ) 

LFS{ brO, Br, 0 ) 
SLWI( I, I, 2 ) 

LFS( aiO, Ai, 0 ) 
SLWI ( J, J, 2 ) 

LFS( bio, Bi, 0 ) 

LFSUX{ arl, Ar, I 

LFSUX( brl, Br, J 

LFSUX( ail, Ai, I 

LFSUX( bil, Bi, J 

LFSUX( ar2, Ar, I 

LFSUX{ br2, Br, J 

LFSUX{ ai2, Ai, I 

LFSUX( bi2, Bi, J 
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y 
m 



I I 1 defined { BUILD_MAX ) 
/* count = N >> 2 */ 
/* byte strides */ 



FMULS { 
LFSUX { 
LFSUX { 
FMULS ( 
LFSUX ( 
LFSUX ( 
FMULS ( 
DECR C( 
FMULS ( 



rsumrO, arO 
ar3 , Ar, I 
br3, Br, J 
rsumiO, aiO 
ai3, Ai, I 
bi3, Bi, J 
isumiO, arO 

count ) 
isumrO, aiO 



brO ) 



biO ) 



bio ) 
brO ) 



BEQ( flush loop_740 
BR(mloop_74 0) 

I -k-k 

Top of 74 0 loop 
*★/ 

LABEL (loop_740) 

LFSUX ( ar3, Ar, I ) 

FMADDS ( rsumrO, arO, 

LFSUX ( br3, Br, J ) 

FMADDS ( rsumi 0 , ai 0 , 

LFSUX ( ai3, Ai, I ) 

FMADDS ( isumiO, arO, 

FMADDS ( isumrO, aiO, 

LFSUX ( bi3, Bi, J ) 



LABEL (ml oop_74 0) 

FMADDS ( rsumrO 
LFSUX ( arO, Ar 

DCBT IF( ACOND 
FMADDS { rsumi 0 
LFSUX ( brO, Br 

DECR C{ count 
FMADDS ( isumiO 
LFSUX ( aiO, Ai 
FMADDS ( isumrO 
LFSUX { bio, Bi 

DCBT IF{ BCOND 
FMADDS ( rsumrO 
LFSUX { arl, Ar 
LFSUX ( brl, Br 
FMADDS ( rsumi 0 
LFSUX ( ail, Ai 
FMADDS ( isumiO 
LFSUX ( bil, Bi 
FMADDS ( isumrO 



brO, rsumrO ) 
bio, rsumi 0 ) 



biO, 
brO, 



isumiO ) 
isumrO ) 



arl, brl, rsumrO ) 
I ) 

Ar, next line ) 
ail, bil, rsumiO ) 
J ) 



arl, bil, isumiO ) 
I ) 

ail, brl, isumrO ) 
J ) 

Br, next line ) 
ar2, br2, rsumrO ) 

I ) 
J ) 

ai2, bi2, rsumiO ) 

I ) 

ar2, bi2, isumiO ) 
J ) 

ai2, br2, isumrO ) 
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5y= 



FMADDS ( rsumrO , 
LFSUX( ar2, Ar, 
FMADDS ( rsumiO , 
LFSUX( br2, Br, 
FMADDS { isumiO, 
LFSUX( ai2, Ai, 
LFSUX( h±2, Bi, 
FMADDS ( isumrO, 
BNE( loop_74 0 ) 

y ** 

Finish last pass 
** / 

FMADDS ( rsumrO ; 
LFSUX( ar3, Ar, 
LFSUX( br3, Br, 
FMADDS { rsurai 0 , 
LFSUX( ai3, Ai , 
LFSUX( bi3, Bi, 
FMADDS { isutniO, 
FMADDS { isurarO, 



ar3, br3, rsumrO ) 
I ) 

ai3, bi3, rsumiO ) 
J ) 

ar3, bi3, isutniO ) 
I ) 
J ) 

ai3, br3, isumrO ) 



arO, brO, rsumrO ) 
I ) 
J ) 

aiO, biO, rsumiO ) 
I ) 
J ) 

arO, bio, isumiO ) 
aiO, brO, isumrO ) 



LABEL { flush loop 740 ) 

FMADDS ( rsumrO, arl, brl, rsumrO ) 

FMADDS { rsumiO, ail, bil, rsumiO ) 

FMADDS { isumiO, arl, bil, isumiO ) 

FMADDS ( isumrO, ail, brl, isumrO ) 

FMADDS ( rsumrO, ar2, br2 , rsumrO ) 

FMADDS ( rsumiO, ai2, bi2 , rsumiO ) 

FMADDS { isumiO, ar2 , bi2, isumiO ) 

FMADDS { isumrO, ai2, br2, isumrO ) 

FMADDS ( rsumrO, ar3, br3 , rsumrO ) 

FMADDS ( rsumiO, ai3 , bi3, rsumiO ) 

FMADDS { isumiO, ar3 , bi3, isumiO ) 

FMADDS ( isumrO, ai3, br3 , isumrO ) 
BR (remain) . 

#endif /** 750 specific code section **/ 

/** 

set up for loop entry, here if N >= 2 

* 

#if defined ( BUILD_603 ) 
LABEL (start 6 03) 



LFS( arO, Ar, 0 
SLWI (1,1,2) 
LFS( aiO, Ai, 0 
SRWK count, N, 
LFSUX( arl, Ar, 
SLWI ( J, J, 2 ) 
LFSUX( ail, Ai, 
LFSUX( ar2, Ar, 
LFSUX( ai2, Ai , 
LFSUX( ar3, Ar, 
LFSUX( ai3, Ai, 



) 



/* byte strides */ 
/* count = N » 2 */ 



DCBT_IF( ACOND, Ar, nextline ) 



LFS( brO, Br, 0 ) 
DECR_C( count ) 
LFS( biO, Bi, 0 ) 
LFSUX( brl, Br, J 
LFSUX{ bil, Bi, J 
LFSUX( br2, Br, J 
LFSUX( bi2, Bi, J 
LFSUX( br3, Br, J 
LFSUX( bi3, Bi, J 



6 
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^0 



DCBT_IF( BCOND, Br, nextline ) 

FMULS { rsumrO, arO, brO ) 

FMULS{ rsumiO, aiO, biO ) 

FMULS ( isumiO, arO, biO ) 

FMULS ( isumrO, aiO, brO ) 



FMADDS( 


rsumrO, 


arl , 


brl. 


rsumrO 


) 


FMADDS( 


rsumiO, 


ail, 


bil. 


rsumiO 


) 


FMADDS( 


isumiO , 


arl. 


bil. 


isumiO 


) 


FMADDS( 


isutnrO, 


ail. 


brl. 


isumrO 


) 


FMADDS( 


rsumrO j 


ar2 , 


br2. 


rsumrO 


) 


FMADDS ( 


rsumiO , 


ai2. 


bi2. 


rsumiO 


) 


FMADDS( 


isumiO , 


ar2. 


bi2, 


isumiO 


) 


FMADDS ( 


isumrO , 


ai2. 


br2. 


isumrO 


) 


FMADDS ( 


rsumrO , 


ar3. 


br3. 


rsumrO 


) 


FMADDS ( 


rsumiO, 


ai3, 


bi3. 


rsumiO 


) 


FMADDS ( 


isumiO, 


ar3. 


bi3. 


isumiO 


) 


FMADDS ( 


isumrO , 


ai3. 


br3. 


isumrO 


) 



/ 



BEQ( remain ) 



main loop maintains four partial sums 
representing two complex sum updates per pass 
**/ 

LABEL (loop) 



LFSUX( 


arO, 


Ar, 


I 


) 


LFSUX ( 


aiO, 


Ai, 


I 


) 


LFSUX( 


arl. 


Ar, 


I 


) 


LFSUX { 


ail , 


Ai, 


I 


) 


LFSUX ( 


ar2 , 


Ar, 


I 


) 


LFSUX ( 


ai2. 


Ai, 


I 


) 


LFSUX ( 


ar3 , 


Ar , 


I 


) 


LFSUX { 


ai3 , 


Ai, 


I 


) 



DCBT_IF( 


ACOND, 


Ar, nextline ) 


DECR C{ 


count ) 








LFSUX ( brO, Br, 


J ) 






LFSUX ( biO, Bi, 


J ) 






LFSUX ( brl, Br, 


J ) 






LFSUX ( bil, Bi, 


J ) 






LFSUX { br2, Br, 


J ) 






LFSUX ( bi2, Bi, 


J ) 






LFSUX ( br3, Br, 


J ) 






LFSUX ( bi3, Bi, 


J ) 






DCBT_IF ( 


BCOND, 


Br, nextline ) 


FMADDS ( 


rsumrO , 


arO , 


brO, 


rsumrO 


FMADDS { 


rsumiO , 


aiO , 


bio. 


rsumiO 


FMADDS { 


isumiO, 


arO, 


biO, 


isumiO 


FMADDS ( 


isumrO, 


aiO, 


brO, 


isumrO 


FMADDS ( 


rsumrO , 


arl. 


brl. 


rsumrO 


FMADDS ( 


rsumiO , 


ail. 


bil. 


rsumiO 


FMADDS ( 


isumiO, 


arl , 


bil. 


isumiO 


FMADDS ( 


isumrO, 


ail. 


brl. 


isumrO 


FMADDS ( 


rsumrO, 


ar2. 


br2, 


rsumrO 


FMADDS ( 


rsumiO, 


ai2. 


bi2. 


rsumiO 


FMADDS { 


isumiO, 


ar2. 


bi2. 


isumiO 


FMADDS ( 


isumrO, 


ai2. 


br2. 


isumrO 


FMADDS { 


rsumrO , 


ar3. 


br3. 


rsumrO 


FMADDS ( 


rsumiO , 


ai3. 


bi3. 


rsumiO 


FMADDS ( 


isumiO , 


ar3. 


bi3, 


isumiO 


FMADDS ( 


isumrO, 


ai3 , 


br3. 


isumrO 
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BNE( loop ) 
#endif /** 603 specific code section **/ 

remainder loop 
**/ 

LABEL (remain) 

ANDI_C( count, N, 2 ) /* bit 2 */ 
BEQ( suml ) 

LFSUX( arc, Ar, I ) 

LFSUX( aiO, Ai, I ) 

LFSUX( arl, Ar, I ) 

LFSUX( ail, Ai, I ) 



LFSUX( brO, Br, J 

LFSUX( biO, Bi, J 

LFSUX( brl, Br, J 

LFSXJX( bil, Bi, J 



FMADDS( rsumrO, arO, brO, rsumrO ) 

FMADDS( rsumiO, aiO, biO, rsumiO ) 

FMADDS( isuTniO, arO, biO, isumiO ) 

FMADDS ( isumrO, aiO, brO , isumrO ) 



FMADDS ( rsumrO , arl , brl , rsumrO ) 

FMADDS ( rsumiO, ail, bil, rsumiO ) 

FMADDS ( isumiO, arl, bil, isumiO ) 

FMADDS ( isumrO, ail, brl, isumrO ) 



LABEL (suml) 

ANDI_C( count, N, 
BEQ( combine ) 



1 ) 



LFSUX( arO, Ar, I ) 

LFSUX( brO, Br, J ) 

LFSUX( aiO, Ai, I ) 

LFSUX( bio, Bi, J ) 



/* bit 0 */ 

/* if no sums left */ 



m 



/** rsumrO 
/** * (S + 



FMADDS ( rsumrO, arO, brO, rsumrO ) 
FMADDS { rsumiO, aiO, biO, rsumiO ) 
FMADDS ( isumiO, arO, biO, isumiO ) 
FMADDS ( isumrO, aiO, brO, isumrO ) 

/** 

combine partial sums, write out results and return 

LABEL (combine) 

FSUBS{ rsumrO, rsumrO, rsumiO ) 
STFS( rsumrO, Cr, 0 ) 
FADDS( isumiO, isumiO, isumrO ) 
STFS( isumiO, Ci, 0 ) 
BR (ret) 

/** 

here for N = 1,2,3 
**/ 

LABEL (do3) 

LFS( arO, Ar, 0 ) 
SLWK I, I, 2 ) 
LFS( aiO, Ai, 0 ) 
LFSXJX( arl, Ar, I ) 
SLWK J, J, 2 ) 
LFSUX( ail, Ai, I ) 
LFSUX( ar2, Ar, I ) 
LFSUX( ai2, Ai, I ) 



= rsumrO - rsumiO 
0) = rsumrO **/ 



/* byte strides */ 



LFS( brO, Br, 0 ) 
DECR_C( count ) 
LFS( biO, Bi, 0 ) 
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LFSUX( brl, Br, J ) 

LFSUX( bil, Bi, J ) 

LFSUX( br2, Br, J ) 

LFSUX{ bi2, Bi, J ) 



FMULS( rsumrO, arO, brO ) 

FMULS ( rsutniO, aiO, biO ) 

FMULS { isutniO, arO , biO ) 

FMULS ( isumrO, aiO, brO ) 

FMADDS( rsumrO, arl, brl, rsumrO ) 

FMADDS( rsumiO, ail, bil, rsumiO ) 

FMADDS( isumiO, arl, bil, isumiO ) 

FMADDS( isumrO, ail, brl, isumrO ) 

FMADDS( rsumrO, ar2, br2, rsumrO ) 

FMADDS{ rsumiO, ai2, bi2, rsumiO ) 

FMADDS( isumiO, ar2, bi2, isumiO ) 

FMADDS( isumrO, ai2, br2, isumrO ) 
BR (combine) 



m 
w 

s 



"1^ 



LABEL {do2) 

LFS( arO, Ar, 0 ) 
SLWI{ I, I, 2 ) 

LFS( aiO, Ai, 0 ) 

LFSUX( arl, Ar, I ) 
SLWK J, J, 2 ) 

LFSUX( ail, Ai, I ) 

LFS( brO, Br, 0 ) 

LFS{ bio, Bi, 0 ) 

LFSUX( brl, Br, J ) 

LFSUX( bil, Bi, J ) 



FMULS ( rsumrO, arO, brO } 

FMULS ( rsumiO, aiO, biO ) 

FMULS ( isumiO, arO, biO ) 

FMULS ( isumrO, aiO, brO ) 

FMADDS( rsumrO, arl, brl, 
FMADDS( rsumiO, ail, bil, 
FMADDS( isumiO, arl, bil, 
FMADDS( isumrO, ail, brl, 
BR (combine) 



/* byte strides */ 



rsumrO 
rsumiO 
isumiO 
isumrO 



LABEL (del) 

LFS( aiO, Ai, 0 ) 

LFS( biO, Bi, 0 ) 

LFS( brO, Br, 0 ) 

LFS( arO, Ar, 0 ) 

FMULS ( rsumiO, aiO, biO) 

FMULS { isumrO, aiO, brO) 

FMSUBS{ rsumrO, arO, brO, rsumiO) 

STFS( rsumrO, Cr, 0 ) 

FMADDS( isumiO, arO, biO , isumrO) 

STFS( isumiO, Ci, 0 ) 

return 
**/ 

LABEL (ret) 

REST fl4 fl9 

REST rl3_rl4 

RETURN 
FUNC EPILOG 



9 
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— MC Standard Algorithms -- PPC Macro language Version 



File Name: GEN R SUMS. MAC 

Description: Multiple small dot product routine for wireless 
group application. 

Entry/ params : 

GEN__R_SUMS (X__bf, Coor_bf, Ptov_map, R_sums^ Num_phys_users) 

Formula: 

num_sums = 0; 

for ( i = 0; i < Num phys users; i++ ) { 
for ( j = 0; j < (int) Ptov_map [i] ; j++ ) { 
sum = 0; 

for ( k = 0; k < 16; k++ ) { 

sum (BF32)X bf[k].real * (BF32)Corr bf->real; 
sum += {BF32)X_bf [k3 .imag * {BF32) Corr_bf ->imag; 
++Corr bf ; 

} 

*R sums++ = sum; 

++num sums; 

} 

X bf += N FINGERS MAX SQUARED; 

} " " " ~ 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 

0.0 000906 fpl Created 



u 
f 



#include "salppc.inc" 

#define DO 10 1 
#define DO__PREFETCH ] 

#if DO 10 



#def ine 


PTOV BUMP 


1 


1 


#def ine 


CORR BUMP 


32 


32 


#def ine 


CORK BUMP 


64 


64 


#def ine 


X BUMP 64 


64 


#def ine 


RSUM BUMP 


8 


8 


#def ine 


RSUM_BUMP_ 


4 


4 


#else 








#def ine 


PTOV BUMP 


1 


0 


#def ine 


CORR BUMP 


32 


0 


#def ine 


CORR BUMP 


64 


0 


#def ine 


X BUMP 64 


0 




#def ine 


RSUM BUMP 


8 


0 


#def ine 


RSUM_BUMP_ 


4 


0 


#endif 








#def ine 


LOAD_CORR ( 


vT, 


rA 



LVX( vT, rA, rB ) 



#define DST BUMP CORR BUMP 64 



#if DO PREFETCH 

#define PREFETCH ( rA, rB, STRM ) \ 
DST{ rA, rB, STRM ) \ 
ADDI( rA, rA, DST_BUMP ) 

#else 

#define PREFETCH { rA, rB, STRM ) 
#endif 
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#define OLOOP_BIT 6 
/ ** 

Input parameters 

* 

#def ine X bf r3 
#define Corr bf r4 
#define Ptov map r5 
#define R sump r6 
#define Num_j)hys__users r7 

I -k-k 

Local GPRS 
** / 

#define icount r8 
#define ptov count r9 
#define indxl rlO 
#define indx2 rll 
#define indx3 rl2 
#define sindexl rl3 
#define dstp rl4 
#define dst_code rl5 

I -k-k 

s , G4 registers 

/ 

13 #define corrOO vO 

1*1 #define corrOl vl 

£ ttdefine corrlO v2 

ttdefine corrll v3 

^0 

#define CO 0 v4 
;5 #define CI 0 v5 

#define CO 8 v6 
llj #define Cl_8 v7 



#define CO 16 v8 
W= #define CI 16 v9 

hi #define CO 24 vlO 

i'i #define Cl_24 vll 



m 



#define XO vl2 

O #define XB vl3 

#define X16 vl4 

#define X24 vl5 

#define sumO vl6 
#define suml vl7 
#def ine zero vlB 



Begin code text 
**/ 

FUNC PROLOG ^ , 

ENTRy_5( gen_R_sums, X_bf, Corr_bf, Ptov_map, R_sump, Num_j)hys_users ) 

CMPWI { Num_phys_users , 0 ) 

BGT( start ) 

RETURN 

LABEL ( start ) 

SAVE rl3 rl5 

USE_THRU__vl8 ( VRSAVE_C01SID ) 

/ ** 

DST setup 
* */ 

MAKE_STREAM_CODE__IIR( dst_COde, DST_BUMP, 1, 0 ) 
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ADDK dstp, Corr bf, 80 ) 
PREFETCH ( dstp, dst_code, 0 ) 



2/23/2001 
/* start prefetch advanced */ 



Setup for outer loop entry 
Read and expand two coor vectors 
Set outer loop counter condition 
**/ 

LI( indxl, 16 ) 
LI( indx2, 32 ) 
LI( indx3, 48 ) 
LI { sindexl, 4 ) 

CMPWI CR{ OLOOP_BIT, Num phys^users, 0 ) 

LVX( corr 00, 0, Corr bf ) 

VXOR{ zero, zero, zero ) 

LVX( corrOl, Corr bf, indxl ) 

LVX( corrlO, Corr bf, indx2 ) 

LVX( corrll, Corr_bf, indx3 ) 

VUPKHSB( CO 0, corr 00 ) 

ADDK Corr bf, Corr bf, CORR_BUMP_64 ) 
VUPKLSBC CO 8, corrOO ) 

ADDI ( Ptov map, Ptov map, -PT0V_BUMP_1 ) 
VUPKHSB( CI 0, corrlO ) 
ADDK R sump, R sump, -RSUM_BUMP_8 ) 
VUPKLSB( CI 8, corrlO ) 
VUPKHSBC CO 16, corrOl ) 
VUPKLSB( CO 24, corrOl ) 
VUPKHSB( CI 16, corrll ) 
VUPKLSB( Cl_24, corrll ) 
/ * * 

Outer loop for each physical user 
★ */ 

LABEL ( oloop ) 

/* { */ 

DECR( Num phys users ) 

LBZU( ptov count, Ptov_map, 1 ) 

BEQ CR( OLOOP BIT, ret ) 

LVX( XO, 0, X bf ) 

LVX( XB, X bf, indxl ) 

SRWI_C { i count, ptov count, 1 ) 

LVX( X16, X bf, indx2 ) 

LVX( X24, X_bf, indx3 ) 

ADDK X bf, X bf, X BUMP 64 ) 

CMPWI CR{ OLOOP BIT, Num_jphys_users , 0 ) 

BEQ_MINUS( one_sum ) 

/** 

Top of sum loop 
Produces two sums each pass 
**/ 

LABEL ( iloop ) 
/* { */ 

PREFETCH ( dstp, dst code, 0 ) 
VMSUMSHSC sumO, CO 0, XO , zero ) 
VMSUMSHS( suml, CI 0, XO, zero ) 
LVX{ corrOO, 0, Corr bf ) 
LVX( corrOl, Corr bf, indxl ) 
LVX{ corrlO, Corr bf, indx2 ) 
VMSUMSHS( sumO, C0_8, X8, sumO ) 
DECR C( icount ) 

VMSUiyiSHS( suml, CI 8, X8, suml ) 
LVX( corrll, Corr bf, indx3 ) 
VUPKHSB{ CO 0, corrOO ) 
VUPKLSB( CO 8, corrOO ) 
VMSUMSHS( sumO 
ViyiSUMSHS( suml 



CO 16, X16, sumO ) 
CI 16, X16, suml ) 



VUPKHSB( CI 0, corrlO ) 
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ADDI( R sump, R sump, RSUM_BUMP_8 ) 

VUPKLSB{ CI 8, corrlO ) 

VMSUMSHS( sumO, CO 24, X24, sumO ) 

VUPKHSB{ CO 16, GorrOl )- 

VMSUMSHS{ suml, CI 24, X24, sural ) 

VUPKLSB{ CO 24, corrOl ) 

VUPKHSB{ CI 16, corrll ) 

VSUMSWS ( sumO , sumO , zero ) 

VUPKLSB( CI 24, corrll ) 

VSUMSWS ( suml, suml, zero ) 

ADDK Corr bf, Corr_bf, CORR_BUMP_64 ) 

VSPLTW( sumO, sumO, 3 ) 

STVEWX( sumO, 0, R sump ) 

VSPLTW{ suml, suml, 3 ) 

STVEWX{ suml, R_sump, sindexl ) 

/* } */ 

BNE{ iloop ) 
/** 

Drop out, check for remainders 
**/ 

ANDI_C ( i count , ptov_count , 0x1 ) 
BEQ{ oloop ) 

One more sum: 

Enters and exits with two coor vectors are loaded and expanded to 16 bit 
**/ 

LABEL ( one sum ) 

VMSUMSHS( sumO, CO 0, XO, zero ) 
VMSUMSHS( sumO, CO 8, X8 , sumO ) 
ADDK R sump, R_sump, RSUM BUMP 8 ) 
VMSUMSHS( sumO, CO 16, X16, sumO ) 
VMSUMSHS{ sumO, CO 24, X24, sumO ) 
VSUMSWS ( sumO, sumO, zero ) 

VSPLTW( sumO, sumO, 3 ) 
STVEWX( sumO, 0, R sump ) 
ADDK R_sump, R_sump, -RSUM_BUMP_4 
*/ 

/** 

Seup for loop re-entry 

loop exit ptr V 
corr 00 corr 10 corr 00 

corrOO corrlO corr 00 corrlO 
loop re-entry ptr 

**/ 

corrlO ) 
0, Corr bf 
corrll ) 

Corr bf , indxl ) 



) /* pre-dec pointer for loop reentsry 



corr 00 consumed in one sum section 



corrlO 
corr 00 



VMR( corrOO, 
LVX( corrlO, 
VMR( corrOl, 
LVX( corrll. 



) 



ADDK Corr_bf, Corr_bf, C0RR__BUMP_32 ) 

VUPKHSB( CO 0, corrOO ) 
VUPKLSB( CO 8, corrOO ) 
VUPKHSB( CI 0, corrlO ) 
VUPKLSB( CI 8, corrlO ) 
VUPKHSB( CO 16, corrOl ) 
VUPKLSB( CO 24, corrOl ) 
VUPKHSB{ CI 16, corrll ) 
VUPKLSB( Cl_24, corrll ) 
/* } */ 

BR{ oloop ) 

Exit routine 
**/ 

LABEL ( ret ) 

FREE THRU vl8 ( VRSAVE_COND ) 
REST rl3 rl5 
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RETURN 
FUNC EPILOG 



m 
m 
w 

US- 



III 



5 
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MC Standard Algorithms 



PPC Macro language Version 



File Name: 
Description: 

Entry /params : 



GEN R SUMS2.MAC 

Multiple small dot product routine for wireless 
group application. 
GEN R SUMS2 {X bf , CorrO bf , Corrl bf , 

Ptov_map, R_sumsO, R_sumsl, Numjp]iys_users) 



Formula : 

num_sums = 0; 

for ( i = 0; i < Num phys users; i++ ) { 
for { j = 0; j < (int)Ptov_map[i] ; j++ ) { 
sum - 0; 

for ( k = 0; k < 16; k++ ) { 

sumO += (BF32)X bf[k].real * {BF32)CorrO bf->real; 
sumO += {BF32)X_bf [k] .imag * (BF32 ) Corr0_bf ->imag; 



suml += (BF32)X bf[k].real 
suml += (BF32)X_bf [k] .imag 
++Corr0 bf; 
++Corrl bf; 



*R sumsO++ = sumO; 
*R sumsl++ suml; 
++num sums; 

} 

X bf += N FINGERS MAX SQUARED; 



{BF32) Corrl bf->real; 
(BF32 ) Corrl_bf - >imag ; 



Revision 

0.0 
0.1 



Mercury Computer Systems, Inc. 
Copyright (c) 2 0 00 All rights reserved 

Date Engineer Reason 

000906 fpl Created 

000908 fpl Fixed zero bug 



m 



#include "salppc.inc" 

#define DO 10 1 
ttdefine DO_PREFETCH 



#if DO 10 



#def ine 


PTOV 


BUMP 


1 


1 


#def ine 


CORR 


BtJMP 


32 


32 


#def ine 


CORR 


BUMP 


64 


64 


#def ine 


X BUMP 64 


64 


#def ine 


RSUM 


BUMP 


8 


8 


#def ine 


RSUM_ 


_BUMP_ 


4 


4 


#else 










#def ine 


PTOV 


BUMP 


1 


0 


#def ine 


CORR 


BUMP 


32 


0 


#def ine 


CORR 


BUMP 


64 


0 


#def ine 


X BUMP 64 


0 




#def ine 


RSUM 


BUMP 


8 


0 


#def ine 


RSUM_ 


_BUMP_ 


4 


0 


#endif 











#define LOAD__CORR ( vT, rA, rB ) 
#define DST BUMP CORR_BUMP_64 



LVX{ vT, rA, rB ) 



#if DO PREFETCH 

#define PREFETCH ( rA, rB, STRM ) \ 
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DST( rA, rB, STRM ) \ 
ADDI ( rA, rA, DST_BUMP ) 
#else 

#define PREFETCH ( rA, rB, STRM ) 
#endif 



m 
m 
y 



5 - 
ftp 



ttdefine OLOOP_BIT 6 
/** 

Input parameters 
** / 

#define X bf r3 
#define CorrO bf r4 
#define Corrl bf r5 
#def ine Ptov map r6 
#define R sumpO r7 
#define R sumpl r8 
#define Num_jphys_users r9 
/ * * 

Local GPRS 
** / 

#define icount rlO 
#define ptov count rll 
#define indxl rl2 
#define indx2 rl3 
#define indx3 rl4 
#define sindexl rl5 
#define dstp rl6 
#define dst code rl7 
#define dst_stride indx3 

G4 registers 

#define corrOO vO 

#define corrOl vl 

#define corrlO v2 

#define corrll v3 

#define corr2 0 v4 

#define corr21 v5 

#define corr3 0 v6 

#define corr31 corrOO 

#define zero v7 



#define CO 
#define CI 
#define C2 
#define C3 



0 v8 
0 v9 
0 vlO 
0 vll 



#define CO 8 vl2 
#define CI 8 vl3 
#define C2 8 vl4 
#define C3 8 vl5 



#define CO 
#define CI 
#define C2 
#define C3 



16 vl6 
16 vl7 
16 vl8 
16 vl9 



#define CO 24 v2 0 
#define CI 24 v21 
#define C2 24 v22 
#define C3 24 v23 



#define XO v24 
tdefine X8 v25 
#define X16 v26 
#define X24 v27 
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#define sumO v2 8 
#define suml v2 9 
#define sum2 v3 0 
#define sum3 v31 
/** 

Begin code text 

•k* f 

FUNC_PROLOG 
#if 1 

NOP /****** alignment may be ittportant ******/ 

tendif 

ENTRY_7{ gen R sums2 , X_bf, CorrO_bf, Corrl_bf, Ptov_map, R_sumpO, R_sumpl, 
Num_j>hys_users ) 

CMPWI{ Num_phys_users, 0 ) 

BGT( start ) 

RETURN 

LABEL ( start ) 
2^ SAVE rl3 rl7 

USE THRU v31{ VRSAVE COND ) 

£3 /** - - 

r** DST setup 

^1^^ SUB{ dst stride, Corrl bf, CorrO bf ) 

MAKE STREAM_CODE IIR( dst code, DST_BUMP, 2, dst_stride ) 
f^: ADDK dstp, CorrO bf, 80 ) /* start prefetch advanced */ 

/* 48: 1087, 64: 1094, 80: 1043, 96: 1058, 112: 1049, 128: 1061 */ 
^ff PREFETCH { dstp, dst_code, 0 ) 

y 

/** 

Setup for outer loop entry 
%J Read and expand two coor vectors 

III Set outer loop counter condition 

**/ 

LK indxl, 16 ) 
LI ( indx2, 32 ) 
fg LI( indx3, 48 ) 

LI { sindexl, 4 ) 

CMPWI_CR( OLOOP_BIT, Num_phys_users, 0 ) 

LOAD CORR{ corrOO, 0, CorrO bf ) 
LOAD CORR{ corrlO, CorrO bf, indx2 ) 
ADDK Ptov_map, Ptov map, -PTOV BUMP_1 ) 
LOAD CORR( corr20, 0, Corrl bf ) 
ADDK R sumpO, R sumpO, -RSUM BUMP_8 ) 
LOAD_CORR( corr30, Corrl_bf, indx2 ) 

LOAD CORR( corrOl, CorrO bf, indxl ) 
ADDK R sunpl, R sumpl, -RSUM BUMP__8 ) 
LOAD CORR( corrll, CorrO_bf , indx3 ) 
VXOR( zero, zero, zero ) 
LOAD_CORR( corr21, Corrl_bf, indxl ) 

VUPKHSBC CO 0, corrOO ) 

ADDK CorrO bf , CorrO_bf, CORR_BUMP_64 ) 
VUPKHSB{ CI 0, corrlO ) 
VUPKHSB{ C2 0, corr20 ) 
VUPKHSB( C3_0, corr3 0 ) 

VUPKLSB( CO 8, corrOO ) 

LOAD CORR( corr31, Corrl_bf , indx3 ) /* corrOO, corr31 same register */ 
VUPKLSB( CI 8, corrlO ) 

ADDI { Corrl_bf , Corrl_bf , CORR_BUMP_64 ) 
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corr20 ) 
corrSO ) 



VUPKHSB( CO 16, corrOl ) 

VUPKHSB( CI 16, corrll ) 

VUPKHSB( C2 16, corr21 ) 

VUPKHSB{ C3 16, corr31 ) 



o 



nj 



PTOV BUMP 1 ) 



1 ) 



) 



VUPKLiSB( CO 24, COrrOl ) 

VUPKLSB( CI 24, corrll ) 

VUPKLSB( C2 24, corr21 ) 

VUPKLSB{ C3_24, corr31 ) 

Outer loop for each physical user 
**/ 

LABEL ( oloop ) 

/M */ 

DECR( Num phys users ) 
LBZU( ptov count; Ptov_Ttiap, 
BEQ CR( OLOOP BIT, ret ) 
LVX( XO, 0, X bf ) 
LVX( X8, X bf , indxl ) 
SRWI_C( icount, ptov count, 
LVX( X16, X bf, indx2 ) 
LVX{ X24, X_bf, indxS ) 
ADDK X bf , X bf, X BUMP 64 
CMPWI CR{ OLOOP BIT, Num_j)hys_users , 
BEQ_MINUS( one_sum ) 

/ * * 

Top of sum loop 
Produces four sums each pass 
* * / 

LABEL ( iloop ) 
/* { */ 

PREFETCH ( dstp, dst code, 0 ) 
LOAD CORR{ corrOO, 0, CorrO_bf ) 
DECR C( icount ) 

LOAD CORR( corrlO, CorrO bf, indx2 ) 
VMSUMSHS( sumO, C0_0 , XO, zero ) 
LOAD CORR( corr20, 0, Corrl bf ) 
VMSUMSHS( suml, C1_0, XO , zero ) 
LOAD CORR( corr3 0, Corrl bf, indx2 ) 
LOAD CORR{ corrOl, CorrO bf, indxl ) 
VMSUMSHS( sum2, C2_0, XO, zero ) 
LOAD CORR( corrll, CorrO bf, indx3 ) 
LOAD CORR( corr21, Corrl bf, indxl ) 
VMSUMSHS( sums, C3 0, XO , zero ) 
VUPKHSB( CO 0, corrOO ) 
VMSUMSHS( sumO, CO 8, X8, sumO ) 
VUPKHSB( CI 0, corrl 0 ) 
ADDK R sumpO, R sumpO, RSUM_BUMP_8 ) 
VUPKHSB{ C2 0, corr20 ) 
VMSUMSHS{ suml, CI 8, X8 , 
VUPKHSB( C3 0, corr30 ) 
VMSUMSHS( sum2, C2 8, X8 , sum2 ) 
VUPKLSB( CO 8, corrOO ) 



0 ) 



suml ) 



C3 8, X8, 



sum3 ) 

^x.^^ V ^^^^^ ^^^^^ , CORR BUMP 64 ) 

LOAD CORR( corr31, Corrl_bf, indx3 ) /* corrOO, corr31 same register */ 
VUPKLSB( CI 8, corrl 0 ) 



VMSUMSHS( sum3 
ADDI ( CorrO bf 



CorrO bf, 



VMSUMSHS( sumO 
VUPKLSB( C2 8, 
VMSUMSHS{ suml 
VUPKLSB( C3 8, 
ADDI { R sumpl , 
VUPKHSB{ CO 16 
VMSUMSHS{ sum2 



CO 16, X16, SUmO ) 

corr2 0 ) 

, CI 16, X16, suml ) 
Gorr30 ) 

R sumpl, RSUM_BUMP_8 ) 
, corrOl ) 

, C2 16, X16, sum2 ) 
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1*1 



%0 

u 
o 



m 



VUPKHSB( CI 16 
VMSUMSHS( sum3 
VUPKHSB{ C2 16 
VMSUMSHS{ sumO 
ADDI ( Corrl bf 
VMSUMSHS{ suml 
VUPKHSB{ C3 16 
VMSUMSHS( SUTO2 
VUPKLSB( CO 24 
VMSUMSHS( sum3 
VSUMSWS( sumO, 
VUPKLSB( CI 24 
VSUMSWS( suml, 
VUPKLSB( C2 24 
VSUMSWS{ sum2, 
VUPKLSB{ C3 24 
VSPLTW( sumO, 
VSUMSWS( sum3, 
VSPLTW( suml, 
STVEWX( sumO, 
VSPLTW( sum2, 
STVEWX( suml, 
VSPLTW( sum3, 
STVEWX( sum2, 
STVEWX( sum3, 
) */ 

BNE ( iloop ) 



corrll ) 

C3 16, X16, sum3 ) 
corr21 ) 

CO 24, X24, sumO ) 
Corrl bf , CORR BUMP_64 ) 
CI 24, X24, suml ) 
corr31 ) 

C2 24, X24, sum2 ) 
corrOl ) 

C3 24, X24, sum3 ) 
sumO, zero ) 

corrll ) 
suml, zero ) 

corr21 ) 
sum2 , zero ) 
corr31 ) 
sumO , 3 ) 

sum3 , zero ) 
suml, 3 ) 
0, R sumpO ) 
sum2 , 3 ) 

R sumpO, sindexl ) 
sum3, 3 ) 
0, R sumpl ) 
R_sumpl, sindexl ) 



Drop out, check for remainders 

* 

ANDI_C ( i count , pt ov_count , 0x1 ) 
BEQ( oloop ) 

/ * * 

One more sum: 

Enters and exits with two coor vectors are loaded and expanded to 16 bit 
** / 

LABEL ( one sum ) 

VMSUMSHS ( sumO , CO 0, XO, zero ) 
ADDI ( R sumpO, R sumpO, RSUM BtJMP_8 ) 
VMSUMSHS ( sum2, C2 0, XO, zero ) 
ADDK R sumpl, R sumpl, RSUM BUMP_8 ) 
VMSUMSHS ( sumO, CO 8, X8, sumO ) 



VMSUMSHS ( sum2, 
VMSUMSHS ( sumO, 
VMSUMSHS ( sum2, 
VMSUMSHS ( sumO, 
VMSUMSHS { sum2, 
VSUMSWS{ sumO, sumO, zero ) 
VSUMSWS { sum2 , sum2 , zero ) 



C2 8, X8, sum2 ) 

CO 16, X16, sumO ) 

C2 16, X16, sum2 ) 

CO 24, X24, sumO ) 

C2 24, X24, sum2 ) 



VSPLTW( sumO, 
STVEWX( sumO, 
VSPLTW( sum2, 
STVEWX( sum2, 
ADDI ( R__sumpO , 
reentry */ 
ADDI ( R_sumpl, 



suraO , 3 ) 

0, R sumpO ) 

sum2 , 3 ) 
0, R sumpl ) 
R_sumpO, -RSUM_BUMP__4 



) /* pre -dec pointers for loop 



R_sumpl, -RSUM__BUMP_4 ) 



Setup for loop re-entry: corrOO consumed in one_sum section 

exit ptr V 
corrOO corrl 0 corrOO corrl 0 

corrOO corrl 0 



corrlO corrOO 
corrOO corrlO 
re-entry ptr 



VMR( corr21, corr31 ) /* corrOO, corr31 same register */ 
VMR( corrOO, corrlO ) 
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LOAD__CORR( corrlO, 0, CorrO_bf ) 

VMR( corrOl, corrll ) 

LOAD_CORR( corrll, CorrO__bf, indxl ) 

VMR{ corr2Q, corr30 ) 

LOAD CORR( corrSO, 0, Corrl bf ) 



m 
w 



VUPKHSB( CO 0, corrOO ) 
VUPKLSB( CO 8, corrOO ) 

LOAD CORR{ corrBl, Corrl_bf, indxl ) /* corrOO, corr31 same register */ 



VUPKHSB ( 


CI 


0, 


corrl 0 ) 




VUPKLSB ( 


CI 


8, 


corrl 0 ) 




VUPKHSB ( 


C2 


0, 


corr20 ) 




VUPKLSB ( 


C2 


8, 


corr20 ) 




VUPKHSB ( 


C3 


0, 


corr30 ) 




VUPKLSB ( 


C3 




corr30 ) 




VUPKHSB ( 


CO 


16, 


corrOl 


) 


ADDK CorrO 


bf, 


CorrO bf, 


VUPKLSB ( 


CO 


24, 


corrOl 


) 


ADDI ( Corrl 


bf , 


Corrl bf, 


VUPKHSB ( 


CI 


16, 


corrll 


) 


VUPKLSB ( 


CI 


24, 


corrll 


) 


VUPKHSB ( 


C2 


16, 


corr21 


) 


VUPKLSB ( 


C2 


24, 


corr21 


) 


VUPKHSB ( 


C3 


16, 


corr31 


) 


VUPKLSB ( 


C3 


24, 


corr31 


) 


} */ 











CORR BUMP 32 ) 



/^ 

BR( oloop ) 

Exit routine 
**/ 

LABEL { ret ) 

FREE THRU v31{ VRSAVE_COND ) 

REST rl3_rl7 

RETURN 
FUNC EPILOG 



m 



6 
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MC Standard Algorithms PPC Macro language Version 



File Name: GEN X ROW. MAC 

Description: 2 Complex scalers (4x1) 2 complex vectors (4xN) 
16 bit complex multiplication producing a 16 
bit complex vector of length 16*N. 

Entry/params : GEN_X__ROW (Al, A2, C, Phys_index, N) 

Formula : 

for { i = 0; i < tot_phys_users ; i++ ) { 

in mpathlp = mpathl bf + (i * N FINGERS MAX) ; 
injitpath2p = mpath2_bf + (i * N_FINGERS__MAX) ; 

j = 0; 

for ( ql = 0; ql < N_FINGERS_MAX ; ql++ ) { 

sir = (BF32)out mpathlp [ql] . real ; 

sli = (BF32)out mpathlp [ql] . imag; 

s2r = (BF32)out mpath2p [ql] . real ; 

s2i = (BF32) out_mpath2p [ql] .imag; 

for ( q = 0; q < N_FINGERS__MAX; q++ ) { 

air = (BF32)in mpathlp [q] .real; 

ali = {BF32)in mpathlp [q] . imag ; 

a2r = (BF32)in mpath2p [q] . real ; 

a2i = (BF32) in__mpath2p [q] . imag; 

cr = (air * sir) + (ali * sli) ; 
ci = (air * sli) - (ali * sir) ; 
cr (a2r * s2r) + (a2i * s2i) ; 
ci += {a2r * s2i) - (a2i * s2r) ; 

X_bf[i * N_FINGERS_MAX_SQUARED + j].real 

= (BF16) (cr » 16) ; 

XjDf[i * N_FINGERS_MAX_SQUARED + j ] . imag 

= (BP16) (ci » 16) ; 

++3 ; 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision Date Engineer Reason 

0.0 000907 fpl Created 



^include "salppc.inc" 

#define LOG N FINGERS MAK 2 
#define LOG ELEMENT_SIZE 2 

#define INDEX_SHIFT (LOG_N_FINGERS_MAX + LOG_ELEMENT_SIZE) 

j -k-k 

Local read-only Permute vector table 
* *y 



RODATA SECTION ( 6 ) 
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START_L_ARRAY ( local_table ) 

L_PERMUTE_]y[ASK ( 0x02031011, 0x06071415, 0x0a0bl819, OxOeOflcld ) 
/** 

32 -> 16 bit: select the 16 MSBs of each 32 bit field 
** / 

Ii_PERMUTE_MASK ( 0x00011011, 0x04051415, 0x08091819, OxOcOdlcld ) 

END_ARRAY 

/** 

API registers 
**/ 

#define Al r3 
#define A2 r4 
#define C r5 
#define Phys_index r6 
#define N r7 
/** 

Integer loop registers 



m 
m 
m 
w 

/ ** 

^ G4 registers 

13 **/ 

#define crOO vO 
#define crOl vl 
#define cr02 v2 
#define cr03 v3 

#define vtmpO vO 
ly #define vtmp2 v2 

#define ciOO v4 
#define ciOl v5 
#define ci02 v6 
#define ci03 v7 



n 









#def ine 


CpO 


C 


#def ine 


Cpl 


r8 


#def ine 


sptrl 


rS 


#def ine 


Cp2 


r9 


#def ine 


sptr2 


r9 


#def ine 


Cp3 


rlO 


#def ine 


tptr 


rlO 


#def ine 


cindex 


rll 


#def ine 


aindex 


rl2 


#def ine 


index 


rl2 



#def ine 


srOO 


v8 


#def ine 


srOl 


v9 


#define 


sr02 


vlO 


#def ine 


sr03 


vll 


#def ine 


siOO 


vl2 


#def ine 


siOl 


vl3 


#def ine 


si02 


vl4 


#def ine 


si03 


vl5 


#def ine 


srlO 


vl6 


#def ine 


srll 


vl7 


#def ine 


srl2 


vl8 


#def ine 


srl3 


vl9 


#def ine 


silO 


v20 


#def ine 


sill 


v21 


#def ine 


sil2 


v22 



2 
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#define sil3 v23 

#define cO v24 

#define cl v24 

#define c2 v25 

#define c3 v26 

#define aOO v27 
#define alO v27 
#define aOl v28 
#define all v29 

ftdefine sval v28 
#define neg_sval v29 

#define vc v30 
#define zero v31 

/** 

Begin code text 

-k-k f 

FXJNC PROLOG 

ENTRY 5( gen X row, Al, A2, C, Phys_index, N ) 
USE_THRU_v31 { VRSAVE_COND ) 
13 /** 

^Sstj: Load up complex scaler 

sval = srO siO srl sil sr2 si2 sr3 si3 
^0 **/ 

■.n LA( tptr, local table, 0 ) 

VXOR( zero, zero, zero ) 
^ LI (index, 0) 

IS /** 

IsJ Byte offset into 16 bit complex vector 

-k-k / 

SLWK Phys index, Phys index, INDEX_SHIFT ) 
ADD( sptrl, Al, Phys index ) 
I A ADD( sptr2, A2, Phys_index ) 

/** 

Load up first scaler: 

if sval = srCsiO srl, sil sr2,si2 sr3,si3 

= sO si s2 s3 

** / 

LVX{ sval, sptrl, index ) /* read 4 16 bit complex values */ 
VSUBSHS{ neg sval, zero, sval ) /* negate complex scaler values */ 
VMRGHW ( vtmpO , sval, sval) /* vtmpO = sO sO si si */ 

VMRGLW ( vtmp2 , sval, sval) /* vttt^2 = s2 s2 s3 s3 */ 

VMRGHW{srO0, vtmpO, vtmpO) /* srO = sO sO sO sO */ 
VMRGLW (srOl, vtmpO, vtmpO) /* srl = si si si si */ 
VMRGHW (sr02, vtmp2 , vtmp2) /* sr2 = s2 s2 s2 s2 */ 
VMRGLW{sr03, vtmp2, vtmp2) /* sr3 = s3 s3 s3 s3 */ 

/it* 

if neg sval = srO,siO srl, sil sr2,si2 sr3,si3 
after perm: 

siO,-srO sil, -srl si2,-sr2 si3,-sr3 
= nsO nsl ns2 ns3 

**/ 

LVX( vc, tptr, index ) 

VPERM{ neg sval, sval, neg sval, vc ) /* si -sr */ 
VMRGHW (vtmpO; neg sval, neg sval) /* vtmpO = nsO nsO nsl nsl */ 
VMRGLW (vtmp2, neg sval, neg sval) /* vtrap2 = ns2 ns2 ns3 ns3 */ 
VMRGHW (siOO, vtmpO, vtmpO) /* siO = nsO nsO nsO nsO */ 
VMRGLW(si01, vtmpO, vtmpO) /* sil = nsl nsl nsl nsl */ 
VMRGHW(si02, vtmp2, vtmp2) /* si2 = ns2 ns2 ns2 ns2 */ 
VMRGLW (si03, vtmp2 , vtmp2) /* si3 = ns3 ns3 ns3 ns3 */ 

/** 

Load up second scaler: 



5 
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LVX( sval, sptr2, index ) /* read 4 16 bit complex values */ 
ADDI (index, index, 16) 

VSUBSHS{ neg sval , zero, sval ) /* negate complex scaler values */ 



VMRGHW ( vtmpO , sval, sval) 
VMRGLW ( vtmp2 , sval , sval ) 
VMRGHW (srlO, VtmpO, vtmpO) 
VMRGLW (srll, vtmpO, vtmpO) 
VMRGHW (sr 12, vtmp2 , vtmp2) 
VMRGLW (srl3, vtmp2, vtrrp2) 



/* 

/ 



VtmpO = sO sO si si 
vtmp2 = s2 s2 s3 s3 
/* srO = sO SO sO SO */ 
/* srl = sl si sl si */ 
/* sr2 = s2 s2 s2 s2 */ 
/* sr3 = S3 s3 s3 s3 */ 



*/ 

/ 



VPERM{ neg sval, sval, neg sval, vc ) /* si -sr */ 



VMRGHW (vtmpO, neg sval, neg sval) 
VMRGLW (vtmp2 , neg sval, neg sval) 



VMRGHW (silO, vtmpO, vtmpO) 
VMRGLW (sill, VtmpO, vtmpO) 
VMRGHW {sil2, vtmp2 , vtmp2) 
VMRGLW (sil3, vtrap2, vtmp2) 



/' 
/* 
/* 
/* 



siO 
sil 
si2 
si3 



/* VtmpO = nsO nsO nsl nsl */ 
/* vtmp2 = ns2 ns2 ns3 ns3 */ 
nsO nsO nsO nsO */ 
nsl nsl nsl nsl */ 
ns2 ns2 ns2 ns2 */ 
ns3 ns3 ns3 ns3 */ 



Assign loop pointers and index registers: 
Loop permute control vector assumes 16 bit input vectors 
C[3 -> 16 X N complex elements 
A[] -> 4 x N complex elements 

N -> 4 byte (i.e. interleaved complex) elements 
* */ 

LVX{ vc, tptr, index ) /* interleaves 16 MSBs of real, imaginary */ 



LI (aindex, 
LI (cindex, 
ADDI ( Cpl , 
ADDI ( Cp2 , 
ADDI ( Cp3 , 



0) 
0) 
C, 
C, 
C, 



16 ) 
32 ) 
48 ) 



Start up loop code: 

Each read on A[] brings in 4 complex input values 
* */ 

LVX( aOO, Al, aindex ) 
DECR_C(N) 

LVX( aOl, A2, aindex ) 
ADDI (aindex, aindex, 16) 



VMSUMSHS( crOO, 
VMSUMSHS{ ciOO, 
VMSUMSHS( crOl, 
VMSUMSHS( ciOl, 
VMSUMSHS( cr02, 
VMSUMSHS{ ci02, 
VMSUMSHS( cr03, 
VMSUMSHS( ci03, 
BEQ( dol ) 



srOO , 
siOO , 
srOl, 
siOl, 
sr02, 
si02, 
sr03. 



aOO, 
aOO, 
aOO, 
aOO, 
aOO, 
aOO, 
aOO, 



zero ) 
zero ) 
zero ) 
zero ) 
zero ) 
zero ) 
zero ) 



si03, aOO, zero ) 



DECR_C(N) 

LVX( alO, Al, aindex ) /* 
VMSUMSHS{ crOO, srlO, aOl, 
VMSUMSHS( ciOO, silO, aOl, 
LVX( all, A2, aindex ) 
VMSUMSHS( crOl, srll, aOl, crOl ) 
BR( mid_loopO ) 

* 

Top of double loop 

* * 



read input for next pass */ 
crOO ) 
ciOO ) 



LABEL ( loopO 

/* { */ 

VMSUMSHS ( 
VMSUMSHS { 
VPERM( c2, 
STVX( Cl, 



) 

crOO, srOO, 
ciOO, siOO, 

cr02, ci02, 
Cpl, cindex 



VMSUMSHS ( crOl, srOl, 



aOO, 
aOO, 
vc 

) 

aOO, 



zero 
zero 



zero ) 
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DECR C{N) 

VMSUMSHS( ciOl, si 01, aOO, zero ) 
VMSUMSHS{ cr02, sr02, aOO, zero ) 
VMSUMSHS{ ci02, si02, aOO, zero ) 
VPERM{ c3, cr03, ci03, vc ) 
STVX{ c2, Cp2, cindex ) 
VMSUMSHS( cr03, sr03, aOO, zero ) 
VMSUMSHS{ ci03, si03, aOO, zero ) 

LVX( alO, Al, aindex ) /* read input for next pass */ 
VMSUMSHS( crOO, srlO, aOl, crOO ) 
VMSUMSHS{ ciOO, silO, aOl, ciOO ) 
LVX( all, A2, aindex ) 
STVX{ c3, Cp3, cindex ) 
VMSXJMSHS( crOl, srll, aOl, crOl ) 
ADDKcindex, cindex, 64) 
LABEL ( mid loopO ) 

VMSUMSHS( ciOl, sill, aOl, ciOl ) 
VMSUMSHS( cr02, srl2, aOl, cr02 ) 

VPERM( cO, crOO, ciOO, vc ) /* begin permute cycle for this pass */ 

STVX{ cO, CpO, cindex ) /* begin write cycle from last pass */ 

VMSUMSHS( ci02, sil2 , aOl, ci02 ) 

ADDKaindex, aindex, 16) 

VMSUMSHS{ cr03, srl3 , aOl, cr03 ) 

VMSUMSHS( ci03, sil3, aOl, ci03 ) 

VPERiy[{ cl, crOl, ciOl, vc ) 

/* } */ 

BNE( loopl ) 
/** 

Drop out to flush 
** / 

VMSUMSHS{ crOO, srOO, alO, zero ) 
VMSUMSHS( ciOO, siOO, alO, zero ) 
VPERM{ c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VMSUMSHS{ crOl, srOl, alO, zero ) 
VMSUMSHS( ciOl, siOl, alO, zero ) 
VMSUMSHS( cr02, sr02, alO, zero ) 
VMSUMSHS( ci02, si02, alO, zero ) 
VPERM( c3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
VMSUMSHS( cr03, sr03, alO, zero ) 
VMSUMSHS( ci03, si03, alO, zero ) 
VMSUMSHSC crOO, srlO, all, crOO ) 
VMSUMSHS( ciOO, silO, all, ciOO ) 
STVX( c3, Cp3, cindex ) 
VMSUMSHS( crOl, srll, all, crOl ) 
ADDI (cindex, cindex, 64) 
VMSt3MSHS( ciOl, sill, all, ciOl ) 
VMSUMSHS{ cr02, srl2, all, cr02 ) 

VPERM( CO, crOO, ciOO, vc ) /* begin permute cycle for this pass */ 
STVX( cO, CpO, cindex ) /* begin write cycle from last pass */ 
VMSUMSHS( ci02, sil2, all, ci02 ) 
VMSUMSHSC cr03, srl3, all, cr03 ) 
VMStJMSHS( ci03, sil3, all, ci03 ) 
VPERM( cl, crOl, ciOl, vc ) 

VPERM( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VPERM( c3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
STVX( c3, Cp3, cindex ) 
BR( ret ) 

/ * * 

Top of second loop 

LABEL ( loopl ) 
/* { */ 
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*0 



zero ) 
zero ) 

read input for next pass 
crOO ) 
ciOO ) 



crOl ) 



VMSUMSHS( crOO, srOO, alO, zero ) 
VMSUMSHS( ciOO, siOO, alO, zero ) 
VPERM{ c2, cr02, ci02, VC ) 
STVX( cl, Cpl, cindex ) 
VMSUMSHS( crOl, srOl, alO, zero ) 
DECR C(N) 

VMSUMSHS( ciOl, siOl, alO, zero ) 
VMSUMSHS( cr02, sr02, alO, zero ) 
VMSUMSHS{ ci02, si02, alO, zero ) 
VPERM( c3, cr03, ci03, VC ) 
STVX( c2, Cp2, cindex ) 
VMSUMSHS( cr03, sr03, alO, 
VMSUMSHS{ ci03, si03, alO, 
LVX( aOO, Al, aindex ) /* 
VMSUMSHS( crOO, srlO, all, 
VMSUMSHS{ ciOO, silO, all, 
LVX( aOl, A2, aindex ) 
STVX{ c3, Cp3, cindex ) 
VMSUMSHS{ crOl, srll, all, 
ADDI {cindex, cindex, 64) 
VMSUMSHS( ciOl, sill, all, ciOl ) 

VMSUMSHS( cr02, srl2, all, cr02 ) _ 

VPERM( cO, crOO, ciOO, vc ) /* begin permute cycle for this pass */ 

STVX( cO, CpO, cindex ) /* begin write cycle from last pass */ 

VMSXJMSHS{ ci02, sil2, all, ci02 ) 

ADDI {aindex, aindex, 16) 

VMSUMSHS{ cr03, srl3, all, cr03 ) 

VMSUMSHS( ci03, sil3, all, ci03 ) 

VPERM( cl, crOl, ciOl, vc ) 

/* ) V 

BNE{ loopO ) 

y< * * 

Flush loop 
**/ 

VMSUMSHS( crOO, srOO, aOO, zero ) 
VMSUMSHS ( ciOO, siOO, aOO, zero ) 
VPERM( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VMSUMSHS ( crOl, srOl, aOO, zero ) 
VMSUMSHS { ciOl, siOl, aOO, zero ) 
VMSUMSHS ( cr02, sr02, aOO, zero ) 
VMSUMSHS { ci02, si02, aOO, zero ) 
VPERM( c3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
VMSUMSHS ( cr03, sr03, aOO, zero ) 
VMSUMSHS ( ci03, si03, aOO, zero ) 
VMSUMSHS ( crOO, srlO, aOl, crOO ) 
VMSUMSHS ( ciOO, silO, aOl, ciOO ) 
STVX{ c3, Cp3, cindex ) 
VMSUMSHS { crOl, srll, aOl, crOl ) 
ADDI (cindex, cindex, 64) 
VMSUMSHS ( ciOl, sill, aOl, ciOl ) 

VMSUMSHS ( cr02, srl2, aOl, cr02 ) ^ 

VPERM( cO, crOO, ciOO, vc ) /* begin permute cycle for this pass */ 

STVX( CO, CpO, cindex ) /* begin write cycle from last pass */ 

VMSUMSHS ( ci02, sil2, aOl, ci02 ) 

VMSUMSHS ( cr03, srl3, aOl, cr03 ) 

VMSUMSHS { ci03, sil3, aOl, ci03 ) 

VPERM( cl, crOl, ciOl, vc ) 



VPERM( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VPERM( c3, cr03, ci03 , vc ) 
STVX( c2, Cp2, cindex ) 
STVX( c3, Cp3, cindex ) 
BR( ret ) 



6 
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LABEL ( dol ) 

VMSUMSHS( crOO, 
VMSUMSHS( ciOO, silO 
VMSUMSHS( crOl, srll 
VMSUMSHS( ciOl, 
VMSUMSHS( cr02. 



srlO, 



sill, 
srl2; 



aOl, 
aOl, 
aOl, 
aOl, 
aOl, 



crOO ) 
ciOO ) 
crOl ) 
ciOl ) 
cr02 ) 



VPERM( cO, crOO, ciOO, vc ) /* begin permute cycle for this pass 
STVX( cO, CpO, cindex ) /* begin write cycle from last pass */ 
VMSUMSHSC ci02, sil2, aOl, ci02 ) 
VMSUMSHS( cr03, srl3 , aOl, cr03 ) 
VMSUMSHS( ci03, sil3, aOl, ci03 ) 
VPERM( cl, crOl, ciOl, vc ) 
VPERM( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VPERM( c3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
STVX( c3, Cp3, cindex ) 



1*1 



/ * * 

Return 
* * / 

LABEL ( ret ) 

FREE THRU_v31( VRSAVE_COND ) 

RETURN 
FUNC EPILOG 
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# include "mudlib.h" 

/* 

* Return the offset in units of complex elements into the CorrO matrix 

* corresponding to a specified starting physical user and starting virtual 

* user (within the starting physical user) pair. 
*/ 

int mudlib get CorrO offset ( 

unsigned char *ptovjrtiap, /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot virt users, /* sum of ptov map over all phys users 

*/ " ~ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov__map [start_phys__user] 

*/ 

) 

{ 

int num_Corrs, num__virt_users ; 

num virt users = mudlib_get_num_virt_users { ptov_map, 0, 0, 
start_phys_user, 

start_virt_user ) - 1; 
num_Corrs = (num virt users * tot virt users) - 

{ (num_virt_users * (num_virt_users + 1) ) / 2) ; 



fill / 



return ( num_Corrs * (num_f ingers * num_f ingers) ) ; 



Return the size (in bytes) of the portion of the CorrO matrix 
corresponding to a specified starting physical user, virtual 
user (within the starting physical user) pair and an ending physical 
user, virtual user pair, inclusive. Elements of CorrO are assumed 
to be of type C0MPLEX_BF8 . 

/ 

int mudlib get CorrO size ( 
l^js unsigned char *ptovjmap, /* no more than 2 56 virts. per phys */ 



m 



int 


num fingers. 


/* 


int 


tot_virt_users , 


/* 


*/ 






int 


start phys user. 


/* 


int 


start_virt_user. 


/* 


*/ 






int 


end phys user. 


/* 


int 


end_virt_user 


/* 



must be < ptov_map [end__phys__user] */ 

) 

int start_of f set, end_offset; 

start_offset = mudlib_get_CorrO_of f set ( ptov map, 

num fingers, 
tot virt users, 
start phys user, 
start_virt_user ) ; 

MUDLIB_INCR_VIRT_USER ( ptov_map, end_phys_user , end_virt_user ) 

end_offset = mudlib_get_CorrO_of f set ( ptov map, 

num fingers, 
tot virt users, 
end phys user, 
end_virt_user ) ; 

return ( (end_offset - start^of f set) * sizeof (COMPLEX BF8) ); 
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/* 

* Return the offset in units of complex elements into the Cor rl matrix 

* corresponding to a specified starting physical user and starting virtual 

* user (within the starting physical user) pair. 
*/ 

int mudlib get Corrl offset ( . * / 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ ^ 

int tot_virt__users , /* sum of ptov_map over all phys users 

*/ / 
int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start jphys__user] 

*/ 

) 

int num_Corrs , num_virt_users ; 

num virt users = mudlib_get__num_virt_users ( ptov_map, 0, 0, 

start phys user, \ i 

— start_virt_user ) - 1; 

num__Corrs = {num_virt_users * tot_virt__users) ; 
return ( num_Corrs * (num_f ingers * num_f ingers) ) ; 

1*5= 1 

~i * Return the size (in bytes) of the portion of the Corrl matrix 

* corresponding to a specified starting physical user, virtual _ 

* user (within the starting physical user) pair and an ending physical 
m * user; virtual user pair, inclusive. Elements of Corrl are assumed 
^ ' * to be of type C0MPLEX_BF8 . 

int mudlib get Corrl size ( , ^ . ^ -u * / 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 

int num fingers, /* typically, 4 */ ^ 

O int tot virt users, /* sum of ptov_map over all phys users 

*/ ~ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov_map [start_j)hys_userj 
*P */ 

th int end phys user, /* zero-based index into ptov map */ 

If:: int end_virt_user /* must be < ptov_map Cend__phys_user] */ 

ly^ ) 

int start_of f set, end_offset; 

start offset = mudlib_get_Corrl_of f set ( ptov map, 

~" num fingers, 

tot virt users, 
start phys user, 
start_virt_user ) ; 

MUDLIB_INCR_VIRT_USER ( ptov_map, end_j)hys_user , end_virt_user ) 

end offset = mudlib_get__Corrl_of f set { ptov map, 

num fingers, 
tot virt users, 
end phys user, 
end_virt_user ) ; 

return ( (end_offset - start_of f set) * sizeof {C0MPLEX_BF8) ); 

} 

/* 

* Return the offset into the RO matrix corresponding to a specified 

* starting physical user and starting virtual user (within the 

* starting physical user) pair. 



m 
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int mudlib get RO offset ( 

\msigned ciiar *ptov_map, /* no more than 256 virts. per phys */ 
int tot_virt_users, /* sum of ptovjmap over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_phys_user] 

*/ 

) 

int i, num_virt_users, offset, tools; 

tools = (tot virt users + R MATRIX ALIGN MASK) & -R MATRIX ALIGN_MASK; 
num virt users = mudlib_get_num_virt_users ( ptov_map, 0, 0, 
start_phys_user, 

start_virt_user ) - 1; 

offset = 0; 

for ( i = 0; i < num__virt_users ; i++ ) 

offset += (tools - (i & ~R_MATRIX_ALIGN_MASK) ) ; 
return offset; 

} 

/* 

* Return the size (in bytes) of the portion of the RO matrxx 

* corresponding to a specified starting physical user, virtual 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. Elements of RO are assumed 
^ip * to be of type BF8. 

%D */ 

f^, int mudlib get RO size ( 



int 


tot_virt_users , 


/* 


*/ 




/* 


int 


start phys user. 


int 


start_virt_user. 


/* 


*/ 




/* 


int 


end phys user. 


int 


end virt_user 


/* 



) 

int start_of f set , end_offset; 

start_offset = mudlib_get_RO_of f set ( ptov map, 

tot virt users, 
start phys user, 
start_virt_user ) ; 

MUDLIB_INCR_VIRT__USER( ptov_map, end_j>hys_user , end_virt_user ) 

end__offset = mudlib_get_RO_of f set ( ptov map, 

tot virt users, 
end phys user, 
end_virt_user ) ; 

return ( (end offset - start_of f set) * sizeof(BF8) ); 

} 

/* 

* Return the offset into the Rl matrix corresponding to a specified 

* starting physical user and starting virtual user (within the 

* starting physical user) pair. 
*/ 

int mudlib get Rl offset ( 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int tot_virt_users , /* sum of ptov_map over all phys users 

int start_phys_user, /* zero-based index into ptov__map */ 
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int start_yirt_user /* must be < ptov^map [start _phys_user] 

) 

int num__virt_users, tools ; 

tcols (tot virt users + R MATRIX ALIGN MASK) & -R MATRIX ALIGN_MASK; 
num virt users = mudlib_get_num_virt_users ( ptov_map, 0, 0, 
start_phys_user , 

start_virt_user ) - 1; 

return { num virt_users * tcols ) ; 

} 

/* 

* Return the size (in bytes) of the portion of the Rl matrix 

* corresponding to a specified starting physical user, virtual 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. Elements of Rl are assumed 

* to be of type BF8 . 
*/ 

int mudlib get Rl size ( 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int tot_virt_users, 

o 

C3 



%0 



int 


tot_virt_users , 


/* 


*/ 




/* 


int 


start phys user. 


int 


start_virt_user. 


/* 


*/ 




/* 


int 


end phys user, 


int 


end virt user 


/* 



) 

0 int start offset, end_offset; 



y start_offset = mudlib__get_Rl_of f set ( ptov map, 
* tot virt users, 

Q start phys user, 

1.5 start virt user ) ; 

1*^' MUDLIB_INCR_VIRT_USER( ptov_map, end_phys_user , end_virt_user ) 



f^»^: end_offset = mudlib_get__Rl_of f set ( ptov map, 

tot virt users, 

fsj end phys user, 

end_virt_user ) ; 

return ( (end offset - start_of f set) * sizeof(BF8) ); 

} 

/* 

* Return the number of virtual users 

* corresponding to a specified starting physical user, virtual 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. 
*/ 

int mudlib get num_virt_users ( 

unsigned char *ptov map, /* no more than 256 virts. per phys */ 
int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov__map [start_j>hys_user] 

int end phys user, /* zero-based index into ptov map */ 

int end_virt_user /* must be < ptov_map [end_phys_user] */ 

) 

{ 

int i, num_virt_users; 

if ( start_phys user == end phys user ) 

return ( end_virt_user - start__yirt__user + 1 ) ; 
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else { 

num_virt users = ptov map [start phys__user] - start virt_user; 
for ( i = (start_phys user + 1) ; i < end_phys__user; i++ ) 

nuTti virt users += ptov map[i] ; 
num virt_users += (end virt_user + 1) ; 
return { num_virt_users ) ; 

} 



*/ 

void 



For a specified starting physical user, virtual user 
(within the starting physical user) pair and a specified 
number of virtual users inclusive of the starting pair, 
return (in separate arguments) , the corresponding ending 
physical user, virtual user pair (inclusive) . 



mudlib get end user_pair ( 

unsigned char *ptov map, 
int start phys user, 
start virt user, 



int 
*/ 
int 
int 
int 



num virt users, 
*end phys user, 
*end virt user 



/* no more than 256 virts. per phys */ 
/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 

/* number from start (must be > 0) */ 

/* zero-based index into ptov map */ 

/* will be < ptov_map E*end_phys_user] */ 



w 



int 1 , jr 

for ( i = start phys user; ; i++ ) { 

for ( j = start virt user; j < ptov map[i]; j++ ) 

if ( --num virt users == 0 ) break; 
if ( num virt users == 0 ) break; 
start virt user = 0; 

} 

*end phys user = i; 
*end virt_user = j ; 



m 
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# include "mudlib.h" 

/************************************************ 

* Virtual users version 

****************************************************** 

int mudlib get CorrO offset v ( 

unsigned char *ptov_map, /*'no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot virt_users, /* sum of ptovjmap over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_phys_user] 

*/ 

) 

{ 

int i, num_f ingers_squared, remaining__size, skipped_virt__users, 
total_size; 

num fingers squared = num_fingers * num_fingers; 
J ^ skipped_virt_users = 0 ; 

S3 for ( i = 0; i < start phys user; i++ ) 

f*1 skipped virt users += (int) ptov map [13; 



skipped virt users += start virt user; 

%e " ~ " " 

II Always even 

total size = tot_virt users * ( tot_virt users - 1 ) ; 
remaining__size = ( tot virt users - skipped virt users ) 
liji * ( tot__virt_users - skipped_yirt_users - 1 ) ; 



} 



// zero based units of complex elements 

return ( num_f ingers_squared * ( ( total_size - remaining_size ) >> 1 ) ) ; 



=H int mudlib get Corrl offset v ( 

f3i unsigned char *ptov__map, /* no more than 256 virts. per phys */ 

int num fingers, /* typically, 4 */ 

int tot_virt_users, /* sum of ptovjmap over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_jnap [start_phys_user] 

*/ 

) 

{ 

int i, num_f Inge rs_s qua red, skipped_virt_users; 

num fingers squared = num_fingers * num__f ingers ; 
skipped_virt_users = 0 ; 

for ( i = 0; i < start phys user; i++ ) 
skipped_virt_users += { int ) ptov_map [i] ; 

skipped_virt__users += start_virt_user ; 

^ return ( num_f ingers__squared * ( skipped_virt_users * tot_virt_users ) ) ; 

int mudlib get RO offset_v { 

unsigned char *ptov_jnap, /* no more than 256 virts. per phys */ 
int tot virt_users, /* sum of ptov map over all phys users 

*/ 
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int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_phys_user] 

*/ 

) 

{ 

int i, iv; 

int RO_skipped_virt_users, RO_tcols, tcols, size; 

tcols - (tot_virt_users + R_MATRIX__ALIGN_MASK) & --R_MATRIX_ALIGN_MASK; 
RO_skipped_yirt_users = 0; 
size = 0; 

for ( i = 0; i < start phys user; i++ ) ( 

for ( iv = 0; iv < (int) ptov_map [i] ; iv++ ) { 

RO__tcols = tcols - {RO_skipped_virt_users & «'R_MATRIX_ALIGN_MASK) ; 

size += RO tcols; 

++R0 skipped_virt_users; 



p /* Handle last physical user, potentially split on virt users */ 

for ( iv = 0; iv < (int) start_virt_user ; iv++ ) { 

RO tcols = tcols - (RO_skipped_virt_users & '-R_MATRIX_ALIGN_MASK) ; 



^0 



size += RO tcols; 
l9 ++R0 skipped_virt_users; 

W } 

* return size; 

W 

int mudlib get RO size v ( 
^p; unsigned char *ptov_niap, /* no more than 256 virts. per phys */ 

fU: int tot virt users, /* sum of ptov_map over all phys users 

"'t * / ~ 

fij int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov_map [start_jphys_user] 

*/ 

int end phys user, /* zero -based index into ptov map */ 

int end_virt_user /* must be < ptovjmap Cend_j)hys_user] */ 

) 

{ 

int i , iv ; 

int RO_skipped__virt_users, RO_tcols, tcols, size; 

tcols = (tot_virt_users + R_MATRIX_ALIGN_]y[ASK) & ^R_MATRIX_ALIGN_MASK; 

RO skipped virt users = 0; 

for ( i = 0; i < start phys user; i++ ) 

RO_skipped_virt_users += (int) ptov_map [i] ; 

RO_skipped_virt_users += start_virt_user ; 

// printf ("skipped: %d\n" , RO_skipped_virt_users) ; 

size = 0; 

if ( start_jphys_user == end_phys_user ) 
// printf ("start == end phys\n"); 



2 
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II <- for Inclusive 

for ( iv = start_virt_user; iv <= (int) end_virt_user; iv++ ) { 

RO_tcols = tcols - (RO_skipped_virt_users & -R_MATRIX_ALIGN_MASK) ; 
size += RO tcols; 

// printf ("size: %d, ROtc: %d\n", size, RO_tcols) ; 
++R0 skipped_virt_users ; 

else 

for ( i = start_phys_user; i < end phys user; i++ ) { 
for ( iv = 0; iv < (int) ptov_map [i] ; iv++ ) { 

RO_tcols = tcols - (RO_skipped_virt_users & -R_MATRIX_ALIGN_MASK) ; 
size += RO tcols; 

// printf ("size: %d, ROtc: %d\n", size, RO_tcols) ; 
++R0 skipped virt_users; 

^' ' ~ ' 

1^5: /* Handle last physical user, potentially split on virt users */ 

1^ // printf ("last phys user \n"); 

1 1 <- for Inclusive 

""•sf for ( iv = start_virt__user; iv <= (int) end_virt_user; iv++ ) { 



RO_tcols = tcols - (RO_skipped_virt_users & ~R_MATRIX_ALIGN_MASK) 
size += RO tcols; 

eg ' // printf ("size: %d, ROtc: %d\n" , size, RO_tcols) ; 

IsTi ++R0 skipped virt users; 

J - - - 



ry 



} 

return size; 



int tnudlib get Rl offset_v ( 

unsigned char *ptov_map, /* no more than 256 virts . per phys */ 

int tot_virt_users , /* sum of ptov_map over all phys users 

int start phys user, /* zero -based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start jphys_user] 



{ 



) 

int i, tcols, virt_users; 

tcols = (tot_virt_users + R_MATRIX_ALIGN__MASK) & -R_MATRIX_ALIGN_MASK; 
virt_users =0; 
// Main loop 

for ( i = 0; i < start phys user; i++ ) { 
virt users += (int) ptov map [13; 

} 

// Trailing virtual users 
virt_users += start_virt_user ; 

return ( virt users * tcols ) ; 



3 
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int mudlib get Rl size v ( 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int tot virt_users, /* sum of ptov_jnap over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov_map [start_phys_user] 

*/ 

int end phys user, /* zero-based index into ptov map */ 

int end_virt_user /* must be < ptov^map [end_phys_user] */ 

) 

{ 

int i, tools, virt_users? 

tools = (tOt_virt_users + R__MATRIX_ALIGN_MASK) & ~R_MATRIX_ALIGN_MASK; 
virt_users = 0; 

if ( start__phys_user == end__phys_user ) 

virt users = end virt user - start virt user + 1; 
else if (startjphys_user < end_j)hys_user) 
// Leading virtual users 

virt_users = (int) ptov_map [start_phys_user] - start_virt_user; 



// Main loop 

for { i = (start phys user + 1) ; i < end__phys_user ; i++ ) 
virt__users += (int) ptov_map [i] ; 



// Trailing virtual users 
J^"! virt users += (end virt user + 1) ; 

U } ~ ~ ~ 

return ( virt users * tools ) ; 

I. 



fU 



4 
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#define lO 1 
#define TIME 0 

// 

// Asynchronous MPIC 

// 

#if TIME 

#include <tmr.h> 
#endif 

#include "mudlib.h" 

void sve3_8bit( BF8 *A, BF8 *B, BF8 *C, BF32 *sum, int n ); 

void dotpr3__8bit ( BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sums, int N, int tools ) ; 

void dotpr6_8bit{ BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sums, int N, int tools ) ; 

void dotpr9__8bit { BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sums, int N, int tools ) ; 

i*1 #if TIME 

J5 Static int time_count = 0; 

static int z; 
%0 static float time; 

static TMR ts timeO, timel; 
static TMR_timespec elapsed; 
#endif 



m 
m 

* void async multirate mpic ( BF8 *Bt hat, BF8 *R0 hat, 

^ * BF8 *R1 hat, BF8 *Rlm hat, 

* BF32 *Y, BF32 Ythresh, 

* int N_users, int N_bits, int N_stages ) 

* N users must be > 0 and divisible by 4 

* N_bits must be >== 5 

fh */ 

fU void mudlib_mpic ( BF8 *Bt hat, 

BF8 *R0 hat, 
BF8 *R1 hat, 
BF8 *Rlm_hat, 
BF32 *Y, 
BF32 Ythresh, 
int N users, 
int N bits, 
int N_stages ) 

{ 

BF8 *Bt hatp; 

BF8 *R0 hatp, *Rl_hatp, *Rlm_hatp; 
BF32 *Yp; 

BF32 R bias, sums [3] ; 

int hat_tc, i, m, N_usersjad, stage; 

hat to = {N___users + R MATRIX ALIGN MASK) & ~R MATRIX ALIGN MASK; 
N_users_pad = (N_users + ALTIVEC_ALIGN_MASK) & ~ALTIVEC__ALIGN_MASK; 

#if 0 

if ( ( (long)Bt hat | (long)RO upper bf | (long)RO lower bf | 

(long)Rl trans bf | (long) Rim bf) & ALTIVEC ALIGN_MASK ) { 
printf { "***** inputs are NON-ALIGNED *****\n" ); 
exit ( -1 ) ; 



1 
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} 

#endif 
// 

// Subtract interference in N_stages 

// 

for ( stage = 0; stage < N_stages; stage++ ) { 

RO hatp = RO hat; 
Rl hatp = Rl hat; 
Rim hatp = Rlm__hat; 
Yp - Y; 

for { i = 0; i < N_users; i++ ) { 

sve3_8bit( R0_hatp, Rl_hatp, Rlm_hatp, &R_bias, N__users_pad ); 

#if 0 

R0_hatp[i3 = BF8_ZER0; /* zero diagonal element */ 

#endif 

Bt hatp = Bt_hat + hat_tc; /* points to leading row */ 

m = 2; 



while ( m < (N bits-4) ) { 

if ( BFABS( Yp[m3 ) < Ythresh ) { 
Q if ( BFABS{ Yp[m+1] ) < Ythresh ) { 

if ( BFABS( Yp[m+2] ) < Ythresh ) { 
* dotpr9_8bit( Bt hatp, Rl hatp, RO hatp, Rlm_hatp, 

%0 sums, N_users_pad, hat__tc ) ; 

fgl sums[0] -= R bias; 

sums[l] -= ((BF32)Bt hatp [hat to + i] * (BF32) Rl_hatp [i] ) ; 
M if ( (Yp[m] - sumsEO]) > BF32 ZERO ) 

Bt_hatp [hat_tc + i] = 1 + BIAS_8BIT; 
s else 

Bt hatp [hat tc + i] = -1 + BIAS 8BIT; 
sums[l] += ( (BF32)Bt_hatp[hat_tc + i] * (BF32 ) Rl_hatp [i] ) ; 

1^ sums[l] -= R bias; 

sums[2] ((BF32)Bt hatp[2*hat tc + i] * {BF32) Rl_hatp [i] ) ; 

«F if ( (Yp[mH-l] - sumsEl]) > BF32 ZERO ) 

{3 Bt_hatp[2*hat_tc + i] = 1 + BIAS_8BIT; 

?|5 else 

Bt hatp[2*hat tc + i] = -1 + BIAS 8BIT; 

sums[2] += ( {BF32)Bt_hatp[2*hat_tc + i] * (BF32) Rl_hatp [i] ) ; 

sums [2] -= R bias; 

if ( {Yp[m+23 - sums [2]) > BF32 ZERO ) 

Bt_hatp[3*hat_tc + i] = 1 + BIAS_8BIT; 
else 

Bt_hatp[3*hat_tc + i] = -1 + BIAS_8BIT; 

else { /* skip third sum */ 

dotpr6_8bit( Bt hatp, Rl hatp, RO hatp, Rlm_hatp, 

sums, N_users_pad, hat_tc ) ; 
sums[0] -= R bias; 

sums[l] -= ({BF32)Bt hatp [hat tc + i] * (BF32) Rl_hatp [i] ) ; 
if ( (Yp[m] - sums[0]) > BF32 ZERO ) 
Bt_hatp Ehat_tc + i] = 1 + BIAS_8BIT; 

else 

Bt hatp [hat tc + i] = -1 + BIAS 8BIT; 
sums[l] += { (BF32)Bt_hatp[hat__tc + i] * (BF32) Rl_hatp [i] ) ; 

sums[l] -= R bias; 

if ( (Yp[m+1] - sums[l]) > BF32 ZERO ) 

Bt_hatp[2*hat_tC + i] = 1 + BIAS_8BIT; 
else 
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} 



Bt_hatp[2*hat_tc + i] = -1 + BIAS_8BIT; 



Bt_hatp += hat__tc; 
++m; 



/* bump leading row pointer */ 
/* bump row */ 
/* skip second sum */ 



el se { 

dotpr3_8bit( Bt hatp, Rl hatp, RO hatp, Rlm_hatp, 

sums, N_users_pad, hat_tc ) ; 
sums[0] -= R bias; 

if ( (Yp[m] - sums[0]) > BF32 ZERO ) 
Bt hatp[hat tc + i] - 1 + BIAS_8BIT; 



#if 10 
#endif 

#if lO 
#endif 

} 

/ 



else 

Bt hatp[hat_tc + i] 

} 

Bt_hatp += hat_tc; 
++m; 

} 

Bt_hatp += hat_tc; 
++m; 



-1 + BIAS 8BIT; 



/* bump leading row pointer */ 
/* bump row */ 

/* bump leading row pointer */ 
/* bump row */ 



* do last 0, 1 or 2 dot product calculations 

while ( m < (N bits-2) ) { 

if ( BFABS( Yp[m] ) < Ythresh ) { 

dotpr3__8bit ( Bt hatp, Rl hatp, RO hatp, Rlm_hatp, 

sums, N_users_pad, hat_tc ) ; 
sums [0] -= R bias ; 

if { (Yp[m] - sums [03) > BF32 ZERO ) 

Bt_hatp Ehat_tc + i] = 1 + BIAS_8BIT; 
el se 

Bt hatp[hat_tc + i] = -1 + BIAS_8BIT; 

} 



#if lO 
#endif 

j 

#if 10 



} 



Bt__hatp += hat_tc; 
++m; 



RO hatp += hat tc; 
Rl hatp += hat tc; 
Rim hatp += hat_tc; 
Yp += N_bits; 

#endif 

#if defined ( COMPILE_C ) 

void dotpr3_8bit{ BPS *A, BF8 *B0, BF8 *B1, BF8 *B2 , 
BF32 *sums, int N, int tcols ) 



/* bump leading row pointer */ 



/* bump pointer */ 

/* bump pointer */ 

/* bump pointer */ 

/* bump pointer */ 

/* end of loop over N users */ 
/* end of loop over N_stages */ 



int j ; 

sums [0] 
for ( j 



BF32_ZERO; 

0; j < N; j++ ) { 
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sumsLO] (BF32)A[j3 * (BF32)B0[j]; 

sums[0] += {BF32) A[tCOls+j] * (BF32)Bl[j]; 
sumsEO] += (BF32)A[(tcols«l)+j] * {BF32)B2[j3; 



void dotpr6_8bit( BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sums, int N, int tcols ) 

{ 

int i , j ; 

for ( i = 0; i < 2; i++ ) { 
sums[i] = BF32_ZER0; 
for ( j = 0; j < N; j++ ) { 

sums[i] += {BF32)A[i*tcols + j] * (BF32)B0[j]; 

sums[i] += (BF32)A[ (i+1) *tcols + j] * {BF32)Bl[j]; 

sumsEi] += (BF32)A[ (i+2)*tcols + j] * (BF32)B2[j]; 

void dotpr9_8bit( BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sunis, int N, int tcols ) 



int 



w 

% 

m 



1/ 3; 



for ( i 



X < 



3; i++ ) { 



SumsLi] = BF32_ZERO; 



} 



for { j = 
sums [ i ] 
sums [i] 
sums [i] 

} 



0; j < N; j++ ) { 
+ = (BF32)A[i*tcols + j] * {BF32)B0[j]; 
+ = (BF32)A[ (i+1) *tcols + j] * (BF32)Bl[j]; 
+= (BF32)AE {i+2) *tcols + j] * (BF32)B2Ej]; 



} 

void sve3_8bit{ BF8 *A, BF8 *B, BF8 *C, BF32 *sum, int n ) 
{ 

int i ; 
BF32 wsura; 



wsum = 0 ; 

for ( i = 0; i < n; i++ 

wsum += (BF32)A[x3; 

wsum += (BF32)B[i] ; 

wsum += (BF32)CEi]; 

} 

*sum = wsum; 



) ( 



#endif 
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MC Standard Algorithms PPC Macro language Version 



File Name: GEN R_MATR ICES .MAC 

Description: Float and scale R matrix values, convert to byte. 

Entry/params : GEN_R_MATRICES ( Rsump, Bf scalep, Inv scalep, 

Scalep, No scale row bfp, 
Scale__row_bfp, Num_virt_users ) 

Formula : 

bf scale = *bf scalep; 
inv_scale = *inv_scalep; 

for ( i = 0; i < num_virt_users; i+n- ) { 
scale = scalep [i]; 
f sum = (float) {R sums [i] ) ; 
fsum *= bf_scale; 

fsum scale = fsum * inv_scale; 
fsum_scale *= scale; 

SATURATE ( fsum_scale ) 
SATURATE { fsum ) 

no scale row bfp[i] = BF8 FIX( fsum ); 
scale__row_bfp[i] = BF8_FIX( fsum_scale ); 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights researved 



Revision 

0.0 
0.1 

0.3 



Date 

000910 
000914 

000920 



Engineer Reason 
fpl Created 

fpl Removed VMAXFP and added 

windin code 
fpl Removed all windin and windout 



# include " salppc . inc" 
#define DO_IO 1 
#if DO ID 

#define SCALE_BUMP_16 16 
#else 

#define SCALE_BUMP_16 0 
#endif 

#define STORE_SCALE( vS, rA, rB ) STVX( vS, rA, rB ) 
#define ZERO__COND 6 
RODATA_SECTION( 6 ) 
START_Ii_ARRAY ( local_table ) 
y * * 

First stage for byte pack 
* */ 

L_PERMUTEjyiASK ( 0x0004080c, 0xl014181c, 0x0004080c, 0xl014181c ) 
/** 
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Second stage for byte pack 

* */ 

L__PERMUTE_MASK( 0x00010203, 0x04050607, 0x10111213, 0x14151617 ) 

END_ARRAY 

/ * * 

Input parameters 

* * / 

#define Rsump r3 
#define Bf scalep r4 
#define Inv scalep r5 
#define Scalep r6 
#define No scale row bfp r7 
#def ine Scale row bfp r8 
#define Nuni_virt_users r9 



w 



13 

m 



Local GPRS 
** y 

ttdefine indxl rlO 
#define indx2 rll 
#define indx3 rl2 

#define low4 rO 

#define tptr indx2 
#define low4x4 low4 

G4 registers 

★ 

#define zero vO 
#define inv scale vl 
#define bf_scale v2 

#define byte pack v3 
#define byte_ttierge v4 

#define scaleO v5 
#define scalel v6 

#define vtmp scalel 
#define scale2 v7 

#define vtmp2 scale2 
#define scales v8 



#define fsumO 
#define fsuml 
#define fsum2 
#define fsum3 



v9 
vlO 
vll 
vl2 



#define fsum scaleO vl3 

#define fsum scalel vl4 

#define fsum scale2 vl5 

#define fsum scale3 vl6 



#define bsumO 
#define bsuml 
#define bsum2 
^define bsum3 



vl7 
vl8 
vl9 

v2 0 



^define bsum scaleO v2l 

i^define bsum scalel v22 

i define bsum scale2 v23 

^idefine bsum scale3 v24 



#define bvector 



v25 



Page No. 252 



EV 093 931 797 US 

No. 279 

gen_r_Tnc 

#define bscale_vector v26 
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#define rsumO v27 

#define rsuml v28 

#define rsuin2 v29 

#define rsum3 v30 

#define seven v31 



Begin code text 
** I 

FUNC_PROLOG 

ENTRY 7( gen R matrices, Rsump, Bf scalep, Inv scalep, Scalep, \ 
~ No_scale_rowjDfp, Scale_row_bf p, Num_virt_users ) 

CMPWI ( Num_virt_users , 0 ) 

BGT( start ) 

RETURN 



LABEL ( start ) 

USE_THRU_v31 ( VRSAVE__COND ) 
/** 

Load up permute vectors and loop scalers 
**/ 

LA( tptr, local_table, 0 ) 
LI( indxl, 16 ) 
%|3 LVX( byte__pack, 0, tptr ) 

VSPLTISB{ seven, 7 ) 
ti, LVX( byte merge, tptr, indxl ) 

\U SCALAR SPLAT ( bf scale, vtmp, Bf scalep ) 

|g SCALAR_SPLAT { inv_scale, vtmp, Inv_scalep ) 

W 

^ Back up to nearest 16-byte boundary. It's okay to write before and after to 

ij; nearest 16-byte boundary in both directions. 

i V * * / 

p RLWINM( low4. No scale__row_bfp, 0, 28, 31 ) /* lower 4 bits */ 

M= VXOR( zero, zero, zero ) 

ADD( Num virt users, Num virt users, low4 ) 
^2 SUB( No scale row bfp. No scale row bfp, low4 ) 

\f SUB( Scale row bfp, SGale_row_bfp, low4 ) 

fIJ SLWI( low4x4, low4, 2 ) 

LI( indx2, 32 ) 

SUB( Rsump, Rsump, low4x4 ) 



Start up loop 
* */ 

LVX( rsumO, 0, Rsump ) 

LI( indx3, 48 ) 

LVX( rsuml, Rsump, indxl ) 

SUB( Scalep, Scalep, low4x4 ) 

LVX{ rsum2, Rsump, indx2 ) 

VCFSX( fsumO, rsumO, 0 ) 

LVX( rsum3, Rsump, indx3 ) 

VCFSX( fsuml, rsuml, 0 ) 

LVX( scaleO, 0, Scalep ) 

VCFSX( fsum2, rsum2 , 0 ) 

LVX( scalel, Scalep, indxl ) 

VCFSX( fsum3, rsum3 , 0 ) 

LVX( scale2, Scalep, indx2 ) 

VMADDFP( fsumO, fsumO, bf scale, zero ) 

LVX( scale3, Scalep, indx3 ) 

VMADDFP( fsuml, fsuml, bf scale, zero ) 

ADDIC C( Num virt users, Num virt users, -16 ) 

VMADDFP( fsum2, fsum2, bf_scale, zero ) 



3 
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VMADDFP{ fsum3, fsum3, 
VMADDFP ( fsum scaleO, 
VMADDFP ( fsum scalel, 
VMADDFP ( fsum scale2, 
ADDI ( Rsump, Rsump, 64 
VMADDFP ( f sum__scale3, 
ADDI( Scalep, Scalep, 
VMADDFP { fsum scaleO, 
VMADDFP ( fsum scalel, 
VMADDFP ( fsum scale2, 
VMADDFP { fsum scale3, 
BLE( sixteen_sums ) 
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i 

B 

5,5 

m 

to 



zero ) 

scale, zero ) 

scale, zero ) 

scale, zero ) 



scaleO, zero ) 

scalel, zero ) 

scale2, zero ) 

scale3, zero ) 



bf scale, 
fsumO, inv 
f suml , inv 
fsum2, inv 

) 

fsum3, inv_scale, zero ) 
64 ) 

fsum scaleO 
fsum scalel 
fsum scale2 
fsum scale3 



LVX( rsumO, 0, Rsump ) 
LVX{ rsuml, Rsump, indxl ) 
VCTSXS( bsumO, fsumO, 24 ) 
LVX( rsum2, Rsump, indx2 ) 
VCTSXS( bsuml, fsuml, 24 ) 
VCTSXS( bsum2, fsum2, 24 ) 
LVX( rsum3, Rsump, indx3 ) 
ADDI{ Rsump, Rsump, 64 ) 
VCTSXS{ bsum3, fsum3, 24 ) 
LVX( scaleO, 0, Scalep ) 
VCTSXS( bsum scaleO, fsum scaleO, 24 ) 
VCTSXS( bsum_scalel, 
LVX( scalel, Scalep, 
VCTSXS ( bsum_scale2 , 
LVX( scale2, Scalep, 



24 ) 



fsum scalel, 
indxl ) 

fsum scale2, 24 ) 
indx2 ) 

ADDIC No scale row bfp. No scale row_bfp, -SCALE_BUMP_16 ) 
VCTSXS ( bsum scale3, fsum scale3, 24 ) 
ADDK Scale_row_bfp, Scale_row_bf p, -SCALE_BUMP_16 ) 



/^ 



BR( mloop ) 

k 

Top of loop outputs 3 2 bytes per trip 
★ 

LABEL { loop ) 
/* { */ 

STORE SCALE ( bvector, 0, No scale_row bfp } 
VCTSXS ( bsum_scale3, fsum scale3, 24 ) 
STORE_SCALE( bscale_vector , 0, Scale_row_bf p ) 



LABEL ( mloop ) 

LVX( scales, Scalep, 
VCFSX{ fsumO, rsumO, 
VPERM{ bsumO, bsumO , 
VCFSX { fsuml , rsuml , 
VCFSX ( f sum2 , r sum2 , 



indx3 ) 
0 ) 

bsuml, byte_j)ack ) 
0 ) 

0 ) 

ADDI (No scale row bfp, No_scale_row_bfp, SCALE_BUMP_16 ) 
VCFSX ( fsum3, rsum3, 0 ) 

ADDK Scale row__bfp, Scale row bfp, SCALE_BUMP_16 ) 
VMADDFP ( fsumO, fsumO, bf scale, zero ) 
VPERM ( bsum2 , bsum2 , bsum3 , byte _pack ) 
VMADDFP ( fsuml, fsuml, bf scale, zero ) 
VMADDFP ( fsum2, fsum2, bf__scale, zero ) 

VMADDFP ( fsum3, fsum3, bf scale, zero ) 
VMADDFP ( fsum scaleO, fsumO, inv scale, zero ) 
VPERM ( bvector, bsumO , bsum2 , byte merge ) 
VMADDFP ( fsum scalel, fsuml, inv scale, zero ) 
ADDIC C( Num virt users, Num_virt users, -16 ) 
VMADDFP ( fsum scale2, fsum2, inv scale, zero ) 
VMADDFP { fsum_scale3, fsum3, inv_scale, zero ) 
ADDK Scalep, Scalep, 64 ) 

VMADDFP ( fsum scaleO, fsum scaleO, scaleO, zero ) 

VPERM ( bsum scaleO, bsum scaleO, bsum scalel, byte_pack ) 

VMADDFP ( fsum_scalel, fsum_scalel, scalel, zero ) 
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VMADDFP{ fsum scale2, fsum scale2, scale2, zero ) 
VMADDFP( fsum scales, fsum scale3, scales, zero ) 
VPERM( bsum scale2, bsum scale2, bsum_scale3, byte_pack ) 

VSRB{ vtmp, bvector, seven ) 
VPERM( bscale vector; bsum scaleO, bsum_scale2, byte_merge ) 

VSRB{ vtmp2, bscale_vector, seven ) 
BLE( loop_flush ) 

IiVX( rsumO, 0, Rsump ) 

VADDSBS( bvector, bvector, vtmp ) 
LVX( rsuml, Rsump, incixl ) 

VADDSBS{ bscale vector, bscale__vector, vtmp2 ) 
LVX( rsum2, Rsump, indx2 ) 
VCTSXS( bsumO, fsumO, 24 ) 
LVX( rsum3, Rsump, indx3 ) 
VCTSXS( bsuml, fsuml, 24 ) 
ADDI ( Rsump, Rsump, 64 ) 
VCTSXS{ bsum2, fsum2, 24 ) 
LVX{ scaleO, 0, Scalep ) 
VCTSXS{ bsumS, fsumS, 24 ) 
LVX( scalel, Scalep, indxl ) 
VCTSXS( bsum scaleO, fsum scaleO, 24 ) 
VCTSXS{ bsum_scalel, fsum scalel, 24 ) 

indx2 ) 

fsum_scale2, 24 ) 



LVX( scale2, Scalep, 
VCTSXS{ bsum_scale2, 
. } */ 
BR( loop ) 



/ * * 
Flush loop 

•k* / 

LABEL ( loop flush ) 

VADDSBS( bvector, bvector, vtmp ) 
STORE SCALE ( bvector, 0, No scale row bfp ) 

VADDSBS{ bscale vector, bscale vector, vtmp2 ) 
STORE_SCALE( bscale vector, 0, Scale row bfp ) 
ADDI{ No scale row bfp. No scale row bfp, SCALE BUMP_16 ) 
ADDK Scale_row_bfp, Scale_row__bf p, SCALE_BUMP__16 ) 

LABEL { sixteen__sums ) 

VCTSXS{ bsumO, fsumO, 24 ) 

VCTSXS{ bsuml, fsuml, 24 ) 

VCTSXS( bsum2, fsum2, 24 ) 

VCTSXS( bsum3, fsum3, 24 ) 

VCTSXS{ bsum scaleO, fsum scaleO, 24 ) 

VPERM( bsumO, bsumO, bsuml, byte pack ) 

VCTSXS( bsum scalel, fsum scalel, 24 ) 

VPERM( bsum2, bsum2, bsum3, byte pack ) 

VCTSXS( bsum scale2, fsum scale2, 24 ) 

VPERM( bvector, bsumO, bsum2 , byte merge ) 

VCTSXS( bsum_scale3, fsum_scale3, 24 ) 

VPERM{ bsum scaleO, bsum scaleO, bsum scalel, byte pack ) 
VPERM( bsum scale2, bsum scale2, bsum_scale3, byte_pack ) 

VSRB{ vtmp, bvector, seven ) 
VPERM( bscale vector, bsum scaleO, bsum_scale2, byte_merge ) 

VADDSBS( bvector, bvector, vtmp ) 

VSRB( vtmp, bscale vector, seven ) 
STORE SCALE ( bvector, 0, No scale row bfp ) 

VADDSBS ( bscale vector, bscale vector, vtmp ) 
STORE_SCALE( bscale_vector , 0, Scale_row_bfp ) 

Return 
* * / 

LABEL ( ret ) 

FREE THRU v31 { VRSAVE_COND ) 
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RETURN 
FUNC EPILOG 
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__*************************************************** 
_ 

--** Majority Voter Control Logic 
_ _** 

__** Description: This Module serves as a generic majority voter 

** 

** 

--** Author : Steven In^eriali/Mirza Cifric 

--** Date : 5-15-2000 

** 

--.** 

*ie***icie****ieie*******ie** ****************** ******* 

LIBRARY IEEE; 

USE lEEE.STD LOGIC 1164 .ALL; 
use ieee.std logic arith.all; 
use ieee.std logic unsigned. all; 
USE STD.TEXTIO.ALL; 



ENTITY m_voter IS 
PORT( 



%0 



i3 



m 



elk 66 pal6 


IN 


std 


logic; 


reset 0 




IN 


std 


logic; 


request 0 


0 


IN 


std 


logic; 


requestl 


0 


IN 


std 


logic ; 


request2 


0 


IN 


std 


logic; 


requests 


0 


IN 


std 


logic; 


request4 


0 


IN 


std 


logic ; 


healthyO 


1 


IN 


std 


logic; 


heal thy 1 


1 


IN 


std 


logic; 


healthy2 


1 


IN 


std 


logic; 


heal thy 3 


1 


IN 


std 


logic ; 


heal thy 4 


1 


IN 


std 


logic; 


voteout_0 




OUT 


std 


logic) ; 



END m_voter; 

ARCHITECTURE voter OF m voter IS 

signal pro: STD_LOGIC VECTOR (3 downto 0); 

signal against: STD_LOGIC_VECTOR{3 downto 0) ; 

signal result: STD_LOGIC; 

BEGIN 



check result :process (request 0_0 , requestl_0 , request2_0 , request3_0 , request4_0 , h 
ealthyO 1, 

healthyl_l,healthy2 l,healthy3 l,healthy4 1) 
variable pro: STD_LOGIC VECTOR (3 downto 0); 
variable against: STD LOGIC VECTOR (3 downto 0); 
variable solution: STD_LOGIC; 
begin 

pro:= "GOOD"; ~- set number of pro voters 

against :=" GOOD"; 
set number of against voters-- Get the number of pros 
if (healthyO_l = '1' and request 0__0= ' 1 ' ) then 
pro := pro + "0001"; 
end i f ; 

if (healthyl 1=^1' and requestl_0= • 1 ' ) then 
pro := pro + "0001"; 
end if; 

if (healthy2_l='l» and request2_0= ' 1 ' ) then 
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pro := pro + "0001"; 

end if; 

if {healthy3 1= ' 1 ' and request3_0= ' 1 ' ) then 
pro : = pro + "0001" ; 
end i f ; 

if {healthy4 1= ' 1 ' and request4_0= ' 1 ' ) then 
pro := pro + "0001"; 
end if; 
Get the number of cons 

if (healthyO 1 = »!' and request0_0= • 0 ' ) then 
against z- against + "0001"; 
end if; 

if (healthyl 1 = '1' and requestl_0 ='0») then 
against := against + "0001"; 
end if; 

if (healthy2 1 = »1' and request2_0 ='0») then 
against := against + "0001"; 
end if; 

if (healthyB 1 ='1' and request3_0 =*0') then 
against := against + "0001"; 
end if; 

if (healthy4 1 =»1' and request4_0 ='0') then 
against := against + "0001"; 
end if; 
final score 



m 



if {pro = "0001" and against < "0001") then 
solution := '1' 

"0010" and against < "0010") then 



and against < "0011") then 
and against < "0011") then 



elsif {pro 
solution := '1' 
elsif {pro = "0011" 
Cfl solution := '1' 

elsif (pro = "0100" 
solution := '1' 
elsif (pro = "0101" and against < "0011") then 
s solution := '1 

else solution := '0'; 

%U end if; 

result <- solution; . put variable val into 

signal val 

voteout_0 <= solution; -- put variable val into 



signal val 
end process Gheck_result ; 



resul t_l at ch: process (reset_0, clk_66_pal6) 
begin 

IF {reset 0 = '0') THEN 

voteout 0 <= ' 1 • ; 

ELSIF rising edge (elk 66 pal6) THEN 
IF result = '0' THEN 

voteout_0 <= * 0 ' ; 

END IF; 

END IF; 
END PROCESS; 

END voter; 
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* FILENAME: mudlib.h 
* 

* CC NUMBER: 
* 

* ABSTRACT: 
* 

* USAGE: 
* 

* COMMENTS: 
* 

* AUTHOR: M. Vinskus 
* 

* DATE: 18-JUL-2000 

/* ©MERCURY. CO PYRIGHT.H® */ 
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#ifndef MUDLIB H 
#define _iy[UDLIB_H 

/***************************************************** 
*** 

* INCLUDE FILES 

************************************************************* 

**/ 

#include <sal . h> 

/*************************************************************************,** 

*** 

* DEFINED CONSTANTS 

*********************************************************** 
**/ 

#define NUM FINGERS LOG 2 

#define NUM FINGERS_SQUARED LOG (2 * NUM FINGERS_LOG) 
#define NUM FINGERS (1 « NUM FINGERS LOG) 

#define NUM_FINGERS_SQUARED (1 << NUM_FINGERS_SQUARED_LOG) 

1,5, #define LI CACHE SIZE 32768 

#define L1_CACHE_LINE_SIZE 32 

13 #define LI CACHE ALIGN_LOG 5 

#define LI CACHE ALIGN (1 << LI CACHE ALIGN_LOG) 
#define L1_CACHE_ALIGN_MASK (L1_CACHE_ALIGN - 1) 

OJ #define R MATRIX ALIGN_LOG 5 

m #define R MATRIX ALIGN (1 << R MATRIX ALIGN_LOG) 

#define R_MATRIX_ALIGN_MASK (R_MATRIX_ALIGN - 1) 



#def ine ALTIVEC ALIGN_LOG 4 

#define ALTIVEC ALIGN (1 << ALTIVEC ALIGN_LOG) 
#def ine ALTIVEC_ALIGN_MASK (ALTIVEC ALIGN - 1) 



m 
y 

1^ #define BF CORR FRAC BITS 8 

g #define BF_CORR_FACTOR ( (float) (1 << BF_CORR_FRAC__BITS) ) 

13 #define BF MPATH FRAC BITS 15 /* this should be dynamic */ 

lU tdefine BF_MPATH_FACTOR ( (float) (1 << BF__MPATH_FRAC_BITS) ) 

#define BF RSUMS FRAC_BITS ((2 * BF_MPATH_FRAC_BITS) - 16 + 
BF CORR_FRAC BITS) 

#define BF RSUMS FACTOR ((float) (1 << BF RSUMS FRAC_BITS) ) 
#define BF_RSUMS_RFACTOR (1.0 / BF_RSUMS_FACTOR) 

#define BF RY FRAC BITS 9 /* 0 <= BF RY_FRAC_BITS <= 14 */ 

#define BF RY FACTOR ((float) (1 << BF RY FRAC_BITS) ) 
#define BF_RY_RFACTOR (1.0 / BF_RY_FACTOR) 

#define BF COMBINED FACTOR ((float) ( 1 « 
(BF RSUMS FRAC BITS-BF RY FRAC BITS) ) ) 

#define BF_COMBINED_RFACTOR (1.0 / BF_COMBINED_FACTOR) 

#define BF8 ZERO 0 

#define BF8 MAX 0x7f 

#define BF8 RY ONE ( {BF8) (1 << BF RY FRAC BITS)) 

#define BF16 RY ONE ( (BF16) (1 << BF_RY_FRAC_BITS) ) 

#define BF16 RY MONE ( -BF16_RY_ONE) 

#define BF16 ZERO 0 

#define BF16 MAX OxVfff 

#define BF32 ZERO 0 

tdefine BF32 RY ONE ( {BF32) (1 « BF_RY_FRAC__BITS) ) 

#define BF32 MAX 0x7fffffff 
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#define BIAS_8BIT 1 

#define BFABS ( x ) { ( (x) >= 0) ? (x) : (-(x))) 
#define FABS { f ) ( ( (f ) >= 0.0) ? (f) : (-(f))) 

* TYPE DEFINITIONS 

typedef long BF32; 
typedef short BF16; 
typedef char BF8; 

typedef struct { 

BF8 real ; 

BF8 imag ; 
} COMPIiEX_BF8 ; 

typedef struct { 

BF16 real; 

BF16 imag; 
} C0MPLEX_BF16; 

i?f:^ typedef struct { 

y BF3 2 real; 

tj BF32 imag; 

} COiyiPLEX_BF32; 

^& /********************************************************** 

w *** 

|g * MACRO DEFINITIONS 

5"'":- ****************************************************************^ 

to **/ 

/* assumes (-(2.0 " 7) - 0.5) < (bf_factor * s) < ((2.0 ^ 7) - 0.5) */ 



13 

y 



#define SFtoBFB { bf factor, s ) \ 

( (BF8) ( (bf_f actor) * (s) + ( ( (s) > 0.0) ? 0.5 : -0.5))) 



f^- #define VFtoBFS ( bf factor, v, bfv, n ) \ 

int 1; \ 

ly float factor = bf factor; \ 

vsmulx ( V, 1, 5cf actor, v, 1, n, 0 ); \ 
for ( i = 0; i < n; i++ ) \ 
^ bfv[i] = (v[i] > 0.0) ? (BFS) (v[i] + 0.5) : {BF8)(v[i] - 0.5); \ 

#define SBF8toF( bf rf actor, bfs ) \ 
( (bf_rf actor) * (float) (bfs) ) 

#define VBF8toF{ bf rf actor, bfv, v, n ) \ 

{ \ 

int i; \ 

float rf actor = bf rf actor; \ 
for ( i = 0; i < n; i++ ) \ 
v[i] = (float) bfv[i] ; \ 
^ vsmulx ( V, 1, &rfactor, v, 1, n, 0 ) ; \ 

/* assumes (-(2.0 " 15) - 0.5) < (bf_f actor * s) < ( (2 . 0 * 15) - 0 . 5) */ 

#define SFtoBF16 ( bf factor, s ) \ 

( (BF16) ( (bf_factor) * (s) + ( ( (s) > 0.0) ? 0.5 : -0.5))) 

#define VFtoBF16 ( bf_f actor, v, bfv, n ) \ 
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float factor = bf factor; \ 

vsmulx ( (float *)v, 1, ^factor, (float *)v, 1, n, 0 ) ; \ 
^ vfixrx ( (float *)v, 1, {BF16 *)bfv, 1, n, 0 ) ; \ 

#define SBF16toF( bf rf actor, bfs ) \ 
( (bf_rf actor) * (float) (bfs) ) 

#define VBF16toF( bf_rf actor, bfv, v, n ) \ 

float r factor = bf rf actor; \ 
vfltx ( (short *)bfv, 1, v, 1, n, 0 ) ; \ 
^ vsmulx ( V, 1, &rf actor, v, 1, n, 0 ) ; \ 

/* assumes (-(2.0 ^ 31) - 0.5) < (bf_factor * x) < ((2.0 31) - 0.5) */ 

#define SFtoBF32 ( bf factor, s ) \ 

( (BF32) ( (bf_f actor) * (s) + ( ( (s) > 0.0) ? 0.5 : -0.5))) 

#define VFtoBF32 ( bf factor, v, bfv, n ) \ 
{ \ 

float factor = bf factor; \ 
1^: vsmulx ( V, 1, Scfactor, (float *)bfv, 1, n, 0 ) ; \ 

J.*:^ vfixr32x ( (float *)bfv, 1, (int *)bfv, 1, n, 0 ); \ 

iH- #define SBF32toF( bf rf actor, bfs ) \ 

((bf rfactor) * (float) (bfs) ) 

%P 

Cfi #define VBF32toF ( bf rfactor, bfv, v, n ) \ 

a { \ 

float rfactor = bf rfactor; \ 
vflt32x ( (int *)bfv, 1, v, 1, n, 0 ); \ 
5 vsmulx ( V, 1, &rfactor, v, 1, n, 0 ) ; \ 

W #define CORR SFtoBF ( s ) SFtoBFS ( BF CORR FACTOR, s ) 

f^: #define MPATH_VFtoBF ( v, bfv, n ) VFtoBF16 { BF_MPATH_FACTOR, v, bfv, ((n)«l) 

13 #define BHAT SFtoBF ( s ) ( (BFS) ( (s) + (float)BIAS 8BIT) ) 

fll #define BHAT SBFtoF( bfs ) { (float) (bfs) - (float)BIAS 8BIT) 

#define BHAT VFtoBF( v, bfv, n ) \ ~ 
{ \ 

float bias = (float) BIAS 8BIT; \ 
vsaddx( V, 1, &bias, v, 1, n, 0 ) ; \ 
^ fixpixax( V, 1, bfv, n, 0 ) ; \ 

#define BHAT VBFtoF ( bfv, v, n ) \ 
{ \ 

float bias = (float) (-BIAS 8BIT) ; \ 
fltpixax( bfv, V, 1, n, 0 ) ; \ 
^ vsaddx( V, 1, &bias, v, 1, n, 0 ) ; \ 

#define RHAT SFtoBF ( s ) SFtoBFS ( BF RY FACTOR, s ) 

#define RHAT SBFtoF( bfs ) SBFStoF ( BF RY RFACTOR, bfs ) 

#define RHAT VFtoBF( v, bfv, n ) VFtoBFB { BF RY FACTOR, v, bfv, n ) 

#define RHAT_VBFtoF( bfv, v, n ) VBF8toF{ BF_RY_RFACTOR, bfv, v, n ) 

#define Y SFtoBF ( s ) SFtoBF32 ( BF RY FACTOR, s ) 

#define Y SBFtoF ( bfs ) SBF32toF( BF RY RFACTOR, bfs ) 

#define Y VFtoBF ( v, bfv, n ) VFtoBF32 ( BF RY FACTOR, v, bfv, n ) 

#define Y_VBFtoF{ bfv, v, n ) VBF32toF( BF_RY_RFACTOR, bfv, v, n ) 
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ttdefine MaDLIB_DECR_VIRT_USER ( ptov_inap, phys_user, virt_user ) \ 



{ \ 



} 



virt user; \ 
if { virt user < 0 } { \ 
~-phys user; \ 

virt user = ptov_map [phys_user] - 1; \ 

} \ 



#define MUDLIB_INCR_VIRT_USER ( ptov_map, phys_user, virt_user ) \ 

{ \ . 

++virt user; \ 

if ( virt user == ptov_map Ephys_user] ) { \ 
++phys user; \ 
virt_user = 0; \ 



} 



} \ 



/*********************************************************************** 
*** 

* PUBLIC FUNCTION PROTOTYPES ^^^^^^o. 

*******;*******************************************^ 
** / 

int mudlib get CorrO offset ( . + / 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ ^ 

int tot_virt__users, /* sum of ptovjmap over all phys users 



) ; 



int start phys user, 
int start_virt__user 
*/ 



/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 



int mudlib get CorrO size ( . -l. * / 

unsigned char *ptov map, /* no more than 256 virts. per phys */ 



int num fingers, 

int tot_virt_users , 
*/ 

int start phys user, 

int start_virt_user , 
*/ 

int end phys user, 

int end virt user 



) 



/* typically, 4 */ 
/* sum of ptov__map over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 

/* zero-based index into ptov map */ 
/* must be < ptov_Tnap Eend_j>hys_user] */ 



int mudlib get Corrl offset { . -u * / 

unsiqned char *ptov map, /* no more than 256 virts. per phys */ 



) 7 



int num fingers, 

int tot_virt_users , 
*/ 

int start phys user, 

int start_virt_user 



/* typically, 4 */ 
/* sum of ptov_map over all phys users 

/* zero -based index into ptov map */ 
/* must be < ptov_map [start_phys_user3 



int mudlib get Corrl size ( . v, + / 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 



int num fingers, 

int tot_virt_users, 
*/ 

int start phys user, 

int start_virt_user, 
*/ 

int end phys user, 

int end virt user 



/* typically, 4 */ 
/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptovjnap [start_phys_user] 

/* zero-based index into ptov map */ 
/* must be < ptov__map [end_phys_user3 */ 



5 



Page No. 264 



EV 093 931 797 US 
Page No. 291 

mudlib.h 



2/23/2001 



int mudlib get RO offset ( 

unsigned char *ptov_map, 

int tot__virt_users , 

*/ 

int start phys user, 
int start_virt_user 
*/ 

) ; 



/* no more than 256 virts. per phys */ 
/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_jnap [start_phys__user] 



int mudlib get RO size ( 

unsigned char *ptov_map, /* 

int tot_virt users, /* 
*/ 

int start phys user, /* 

int start_virt_user, /* 
*/ 

int end phys user, /* 

int end_virt_user /* 

) ; 



no more than 256 virts. per phys */ 
sum of ptovjmap over all phys users 

zero-based index into ptov map */ 
must be < ptov_map [start jphys_user] 

zero -based index into ptov map */ 
must be < ptov_map [end_phys_user] */ 



m 
m 
w 
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■s! st- 



int mudlib get Rl offset { 

unsigned char *ptovjmap, 
int tot_virt_users , 

int start phys user, 
int start virt user 



) ; 



int 



mudlib get Rl size { 

unsigned char *ptov_map, 
int tot_virt users. 



*/ 

int 

int 

*/ 

int 

int 



start phys user, 
star t_virt_user , 

end phys user, 
end virt user 



) 



mt 



mudlib get num_virt_users { 

unsigned char *ptov map, 
int start phys user, 
int start virt_user, 
*/ 

int end phys user, 
int end_virt_user 

) ; 



/* no more than 256 virts. per phys */ 
/* sum of ptov__map over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map [start__j)hys_user] 



/* no more than 256 virts. per phys */ 
/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map Cstart_phys_user3 

/* zero-based index into ptov map */ 
/* must be < ptov__map [end_phys_user] */ 



/* no more than 256 virts. per phys */ 
/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user3 

/* zero-based index into ptov map */ 
/* must be < ptov_map [end_phys_user] */ 



void mudlib get end user_pair ( 

unsigned char *ptov map, 
int start phys user, 
start virt user. 



xnt 
*/ 
int 
int 
int 



num virt users, 
*end phys user, 
*end virt user 



) 



/* no more than 256 virts. per phys */ 
/* zero-based index into ptov map */ 
/* must be < ptov_map [start_jphys_user] 

/* number from start (must be > 0) */ 
/* zero-based index into ptov map */ 
/* will be < ptov_map [*end_j>hys__user] ^ 



void mudlib gen R { 

COMPLEX BF16 
COMPLEX BF16 
COMPLEX BF8 
*/ 

COMPLEX BF8 
V 



*mpathl bf, 
*mpath2 bf, 
*corr_0_bf , 

*corr_l bf. 



/* adjusted for starting physical user 
/* adjusted for starting physical user 
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unsigned char *ptov_map, 
float *bf scalep, 
float *inv_scalep, 
*/ 

float *scalep, 
char *L1 cachep, 
BF8 *R0 upper bf , 
*R0 lower bf, 
*R1 trans_bf, 
*Rlm bf , 
tot phys users, 
tot virt users, 
start phys user, 
start virt user, 
end_phys_us er , 



) 



BF8 
BF8 
BF8 
int 
int 
int 
int 
int 
*/ 
int 



end virt user 



2/23/2001 

/* no more than 256 virts. per phys */ 

/* scalar: always a power of 2 */ 

/* adjusted for starting physical user 

/* start at O'th physical user */ 

/* must be 32 -byte aligned */ 



/* zero-based ("starting row") */ 
/* relative to start phys user */ 
/* actual number of "rows" to process 

/* relative to end_phys_user */ 



void 



1^ 



m 



mudlib 4R_to 3R { 

BF8 *R0 upper bf , 

BF8 *R0 lower bf, 

BF8 *R1 trans bf, 

char *L1 cachep, 

BF8 *R0 bf, 

BF8 *R1 bf, 

int tot_virt_users 

); 



/* 
/* 
/* 
/* 
/* 
/* 



input matrix */ 
input matrix */ 
input matrix */ 

32K-byte temp, 32-byte aligned */ 
output matrix */ 
output matrix */ 



void mudlib_mpic ( BF8 *Bt hat, 

BF8 *R0 hat, 
BF8 *R1 hat, 
BF8 *Rlm_hat, 
BF32 *Y, 
BF32 Ythresh, 
int N users, 
int N bits, 
int N_stages ) ; 

void mudlib_reformat_corr ( COMPLEX *in_corr, 

COMPLEX BPS *corr 0 bf , 
COMPLEX BF8 *corr l_bf, 
int num virt users, 
int num_multipath ) ; 

void fixed__zidotprx { COMPLEX SPLIT *A, int I, COMPLEX SPLIT *B, int J, 

COMPLEX_SPLIT *C, int N, int X ) ; 



/* 
* 

*/ 

int 



temp names (_v) 

mudlib get CorrO offset v ( 

unsigned char *ptov__map, 
int num fingers, 

tot_virt_users , 



) 



mt 
*/ 
int 
int 
*/ 



start phys 
start virt 



user, 
user 



/* 
/* 
/* 

/* 
/* 



no more than 256 virts. per phys */ 
typically, 4 */ 

sum of ptovjmap over all phys users 

zero-based index into ptov map */ 
must be < ptovjmap [start_phys_user] 



int mudlib get Corrl offset v ( 

unsigned char *ptov_map, 

int num fingers, 

int tot virt users, 

*/ ~ ~ 

int start_phys_user, 



/* no more than 2 56 virts. per phys */ 
/* typically, 4 */ 

/* sum of ptov_map over all phys users 
/* zero-based index into ptov_map */ 
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); 



int start_virt_user 



/* must be < ptov_map [start_phys_user] 



int mudlib get RO offset_v ( 

unsigned char *ptov_map, /* 

int tot_virt_users, /* 
*/ 

int start phys user, /* 

int start virt user /* 



); 



no more than 256 virts. per phys */ 
sum of ptov_map over all phys users 

zero-based index into ptov map */ 
must be < ptovjmap [start_phys_user] 



int mudlib get RO size v ( 

unsigned char *ptov_map, 
int tot virt users. 



*/ 

int 

int 

*/ 

int 

int 



start phys user, 
start_virt_user, 

end phys user, 
end virt user 



) 



/* 
/* 

/* 
/* 

h 



no more than 256 virts. per phys */ 
sum of ptov_map over all phys users 

zero-based index into ptov map */ 
must be < ptov_map Estart_phys_user] 

zero -based index into ptov map */ 
must be < ptovjcnap [end phys user] */ 



y 



5 y 



int mudlib get Rl offset_v ( 

unsigned char *ptov_map, 
int tot_virt_users. 



) 



*/ 

int 
int 



int 



start phys user, 
start virt user 



( 



mudlib get Rl size v 

unsigned char *ptov_map, 
int tot_virt_users. 



*/ 

int 

int 

*/ 

int 

int 



start 
start 



phys user, 
_virt_user. 



end phys user, 
end virt user 



) 



/* no more than 256 virts. per phys */ 

/* sum of ptov_map over all phys users 

/* zero -based index into ptov map */ 

/* must be < ptov_map [start_phys_user3 



/* no more than 256 virts. per phys */ 
/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 

/* zero-based index into ptov map */ 
/* must be < ptov_map Eend_jphys_user3 */ 



#endif /* MUDLIB H */ 
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#include "mudlib.h" 

#define INDEX_5D TO_LIN(aO, al, a2, a3, a4, max al, max a2, max a3, max a4) \ 
((a4) + {max_a4) * ( (a3) + (max_a3) * ( (a2) + {max_a2) * ( (al) 



\ 



+ (max al) * (aO))))) 



void mudlib reformat corr ( 
COMPLEX *in_corr, 
COMPLEX BF8 *corr 0 bf , 
COMPLEX BF8 *corr l_bf , 
int num virt users, 
int num_fingers ) 



m 

m 
m 
w 

s 

hi 

m 



int i, j, q, ql; 

for ( i = 0; i < num_virt users; i++ ) { 

for ( j = (i+1) ; j < num virt users; j++ ) { 
for ( q = 0; q < num_fingers; q++ ) { 

for ( ql = 0; ql < num fingers; ql++ ) f 

corr_0_bf->real = CORR_SFtoBF ( in_corr [INDEX 5D_T0_LIN( 

0, i, j, ql, q, 
num virt users, 
num virt users, 
num fingers, 
num fingers)] .real ) ; 
corr__0__bf->imag = C0RR_SFtoBF( in_corr [INDEX 5D_T0_LIN( 

0, i, j. ql, q, 
num virt users, 
num virt users, 
num fingers, 

num_f ingers) ] .imag ) ; 

++corr 0 bf; 

for ( i = 0; i < num virt users; i++ ) { 
for ( j = 0; j < num virt users; j++ ) { 
for ( q = 0; q < num_f ingers ; q++ ) { 

for ( ql = 0; ql < num fingers; ql++ ) { 

corr__l_bf->real = CORR_SFtoBF ( in_corr [INDEX 5D_T0_LIN( 

1, i, j, ql, q, 
num virt users, 
num virt users, 
num fingers, 

num fingers)] .real ) ; 
corr_l_bf->imag = CORR_SFtoBF( in_corr [INDEX 5D_T0_LIN( 

1, i, j, ql, q, 
num virt users, 
num virt users, 
num fingers, 
num_f ingers) ] .imag ) ; 

++corr 1 bf; 

} 



} 



} 
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# include "mudlib.h" 



void mtrans32 8bit ( 
BF8 *A, 
BF8 *C, 
*/ 

char *L1 cachep, 
int A ncols, 
int A nrows, 
int C_tcols 

) ; 

void mtriangle 8bit ( 
BF8 *A, 
BF8 *C, 
int N 

); 



/* logically contiguous input 32 x 32 blocks */ 
/* output blocks separated by 32 * out__tc elements 



m 



m 



/* input matrix */ 
/* input matrix */ 
/* input matrix */ 

/* temp: 32K bytes, 32 -byte aligned 

/* output matrix */ 
/* output matrix */ 



void mudlib_4R to 3R ( 

BF8 *R0 upper bf , 
BF8 *R0 lower bf , 
BF8 *R1 trans bf, 
char *Ll_cachep, 
*/ 

BF8 *R0 bf, 
BF8 *R1 bf, 
int tot_virt_users 

) 

{ 

BFS *R0 work; 

int i, nrows, RO_tcols, tcols; 

tcols = (tot_virt_users + R_MATRIX_ALIGN_MASK) & -R_MATRIX_ALIGN_MASK; 

nrows = R__MATRIX ALIGN; 

for ( i = tot virt_users; i > 0; i -= R_MATRIX_ALIGN ) { 
if ( nrows > i ) nrows = i; 

mtrans32_8bit ( Rl trans bf, Rl_bf , Ll_cachep, tot_virt_users, 

nrows, tcols } / 
Rl trans_bf += (tcols' « R_MATRIX_ALIGN_LOG) ; 
Rl bf R MATRIX ALIGN; 
} " " " 

RO work = RO bf; 
RO tcols = tcols ; 
nrows = R^MATRIX ALIGN; 

for { i = tot virt_users; i > 0; i -= R_MATRIX_ALIGN ) { 
if { nrows > i ) nrows = i; 

mtrans32 8bit ( RO lower_bf, RO work, LI cachep, i, nrows, tcols ); 
RO lower bf += (RO_tcols « R MATRIX ALIGN LOG) ; 
RO work += ((tcols « R MATRIX_ALIGN_LOG) + R__MATRIX_ALIGN) ; 
RO tcols -= R MATRIX ALIGN; 
} " " 

mtriangle_8bit ( RO_upper_bf, RO_bf, tot_virt_users ); 



#if COMPILE_C 

void mtrans32 8bit ( 
BPS *A, 
blocks */ 
BFS *C, 



/* logically contiguous input A_nrows x A_ncols 

/* output blocks separated by 32 * C_tcols elements 



char *L1 cachep, 
int A ncols. 



1 
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int A nrows, 
int C_tcols 

) 

{ 

BF8 *Ap, *Cp; 

int A tcols, C_nrows; 

int i , j ; 

(void) Ll_cachep; 

A tcols = (A ncols + R MATRIX_ALIGN_MASK) & ~R_MATRIX_ALIGN_MASK ; 
C_nrows = R_MATRIX_ALIGN; 

while ( A ncols ) { 

if (A ncols < C_nrows ) C_nrows = A_ncols; 
Ap = A; 

CP = C; . ^ r 

for ( i = 0; 1 < A_nrows; i++ ) { 

for ( j = 0; j < C nrows; j++ ) 
Cp[j * C tcols] = Ap[j]; 

Ap += A tcols; 

Cp += 1; 

1^. A += R MATRIX__ALIGN; /* input travels horizontally */ 

f*^ C += (C_tcols « R MATRIX_ALIGN_LOG) ; /* output travels vertically */ 

A_ncols -= C_nrows; 

} 



^0 void mtriangle 8bit { 

=kU BF8 *A, 

CO BF8 *C, 

I5H int N 

^ { 

Q int A counter, A_tcols, altivec_N, C__tcols; 
int i , j ; 



A counter = (N + R iyiATRIX_ALIGN_MASK) & ~R_MATRIX_ALIGN__MASK; 
C_tcols = A_counter + 1; 

altivec_N = (N + ALTXVEC_ALIGN_MASK) & -.ALTIVEC_ALIGN_MASK; 

for { i = 0; i < N; i++ ) { 

for ( j = 0; j < altivec_N; j++ ) 
C[j] - A[j] ; 

--altivec N; 
--A counter; 

A_tCOls = (A counter + R_MATRIX_ALIGN_MASK) & «R_MATRIX_ALIGN_MASK; 
A += (A tcols + 1) ; 
C += C tcols; 

#endif /* COMPILE_C */ 
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m 



MC Standard Algorithms -- PPC Macro Language Version 



File Name: mtrans32 Sbit.mac 

Description: Perform N_tiles 32 x 32 byte transposes 



contiguous input 32 x 32 blocks 
output blocks separated by 
32 * out tc elements 



void mtrans32 8bit ( 
BF8 *A, 
BF8 *C, 

char *L1 cache, 
int A ncols, 
int A nrows, 
int C tools 



BPS *Ap, *Cp; 

int A tools, C_nrows; 

int i , j ; 

A tcols = (A ncols + R MATRIX ALIGN_MASK) & 

~R MATRIX ALIGN_MASK; 
C_nrows - R_MATRIX_ALIGN; 

while ( A ncols ) { 

if (A ncols < C_nrows ) C_nrows = A_ncols; 
Ap = A? 

CP = C; . , / 

for { i = 0; i < A_nrows ; i++ ) \ 

for ( j =0; j < C nrows; j++ ) 
Cp[j * C tcols] = Ap[j]; 

Ap += A tcols; 

Cp 1; 

} 

A += R MATRIX__ALIGN; 

C += (C_tcols « R MATRIX_ALIGN_LOG) ; 
A ncols -= C_nrows; 



} 

Restrictions : 



A, C and LI cache must all be 16-byte aligned. 
C_tcols must be a multiple of 16. 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 

0.0 000913 fpl Created 



#include "salppc.inc" 

#define DO_PREFETCH l 

#define LOAD INPUT ( vT, rA, rB ) 
#define LOAD_CACHE { vT, rA, rB ) 

#define STORE CACHE ( vS, rA, rB ) 
#define STORE_OUTPUT ( vS, rA, rB ) 



LVXL( vT, rA, rB ) 
LVX( vT, rA, rB ) 

STVX( vS, rA, rB ) 
STVX( vS, rA, rB ) 



#define R MATRIX ALIGN_LOG 5 

#define R MATRIX ALIGN (1 « R MATRIX ALIGN__LOG) 

#define R__MATRIX_ALIGN_MASK (R__MATRIX_ALIGN - 1) 
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#def ine ALTIVEC ALIGN_LOG 
#def ine ALTIVEC ALIGN 
#define ALTIVEC_ALIGNjyiASK 



(1 « ALTIVEC ALIGN_LOG) 
(ALTIVEC ALIGN - 1) 



#if DO PREFETCH 

#define PREFETCH { rA, rB, STRM, DST__BUMP ) \ 

DSTT( rA, rB, STRM ) \ 

ADD{ rA, rA, DST__BUMP ) 
#else 

#define PREFETCH { rA, rB, STRM, DST_BUMP ) 
#endif 



in 



Four permute vectors for output stage 
RODATA_SECTION ( 5 ) 
START_L_ARRAY ( local_table ) 



PERMUTE MASK( 0x00010405, 0x08090c0d, 0x10111415, 

PERMUTE MASK( 0x02030607, OxOaObOeOf , 0x12131617, 

PERMUTE MASK{ 0x00020406, 0x080a0c0e, 0x10121416, 

PERMUTE MASK{ 0x01030507, 0x090b0d0f, 0x11131517, 



0xl8191cld 
Oxlalblelf 
OxlSlalcle 



0xl91bldlf ) 



END ARRAY 



w 



1^ 



m 



/ ** 

Input parameters 

* 

^define A r3 

^define C r4 

^define Ll_cache r5 

define NC r6 

tdefine NR r7 

#define TCC r8 

#define NC left NC 

#define TCA r9 

#define TCA4 rlO 

#define icount rll 

#define aptrO rl2 

#define aptrl rl3 

#define aptr2 rl4 

#define aptr3 rl5 



#define aindxO rl6 

#define aindxl rl7 

#define aindx2 rl8 

#define aindx3 rl9 

#define cptrO r20 

#define cptrl r21 

#define cptr2 r22 

#define cptr3 r23 

#define cindxO r24 

#define cindxl r25 

ttdefine cindx2 r26 

#define cindx3 r27 



#define cindx4 aindxO 

#define cindxB aindxl 

#define cindx6 aindx2 

idefine cindx? aindx3 
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#define out indxO aptrO 

#define out indxl aptrl 

#define out indx2 aptr2 

#define out_indx3 aptr3 

#define cptr cptrO 

#define outptrO cptrl 

#define outptrl cptr2 

#def ine TCC4 cptrS 



#define tptr 
#define temp 



i count 
aptr3 



#define Cbump rO 
#define dstp rO 
#define dst code r28 



\0 



I -k-k 

G4 registers 

#define aOO vO 

#define aOl vl 

#define a02 v2 

#define a03 v3 

#define alO v4 

#define all v5 

^define al2 v6 

^^define al3 v7 

^define a20 v8 

^define a21 v9 

#define a22 vlO 

#define a23 vll 



#define a3 0 
#define a31 
#define a32 
#define a33 



vl2 
vl3 

vl4 
vl5 



#define cOO 
#define cOl 
#define c02 
#define c03 



vl6 
vl7 
vl8 
Vl9 



#define clO 
#define cll 
#define cl2 
#define cl3 



v20 
v21 
v22 
v23 



2/23/2001 



#def ine 


c20 


cOO 


#def ine 


c21 


cOl 


#def ine 


c22 


c02 


#def ine 


c23 


c03 


#def ine 


c30 


ClO 


#def ine 


c31 


cll 


#def ine 


c32 


cl2 


#def ine 


c33 


cl3 


#def ine 


vtO 


v24 


#def ine 


vtl 


v25 


#def ine 


vt2 


v26 


#def ine 


vt3 


v27 


#def ine 


vt4 


cOO 


#define vt5 


cOl 
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#def ine 


vt6 


c02 


#def Ine 


vt7 


c03 


#def ine 


vpO 


v28 


#def ine 


vol 


v29 


#define 


vp2 


v30 


#def ine 


vp3 


v31 


#define 


CO 


aOO 


#def ine 


cl 


aOl 


#def ine 


c2 


a02 


#def ine 


c3 


a03 


#def ine 


C4 


alO 


#def ine 


c5 


all 


#def ine 


c6 


al2 


#def ine 


C7 


al3 



#def ine 


cute 


a20 


#def ine 


outl 


a21 


#def ine 


out2 


a22 


#def ine 


out 3 


a23 


#def ine 


out 4 


a30 


#def ine 


outs 


a31 


#def ine 


out 6 


a32 


#def ine 


out? 


a33 


/ ** 






Text begins 




* 







^0 FUNC PROLOG 

Cfi ENTRY__5( mtrans32_8bit. A, C, Ll_cache, N, TCC ) 

in 



SAVE rl3 r28 
U USE THRU v31( VRSAVE COND ) 



5=«= ADDK TCA, NC, R MATRIX_ALIGN__MASK ) 

H CMPWK NC left, 32 ) 

W RLWINM( TCA, TCA, 0, 0, (31 - R_MATRIX__ALIGN_LOG) ) 

U 

"sss LA( tptr, local table, 0 ) 

MAKE STREAM CODE IIR { dst code, 64, 4, TCA ) 
13 _ _ _ 

m LVXC vpO, 0, tptr ) 

ADDK tptr, tptr, 16 ) 

LVX( vpl, 0, tptr ) 

ADDK tptr, tptr, 16 ) 

XORK temp. A, 32 ) 

LVX{ vp2, 0, tptr ) 

ADDK tptr, tptr, 16 ) 

SLWK TCA4, TCA, 2 ) 

LVX{ vp3, 0, tptr ) 



BLE( cont ) 



ANDI C( temp, temp, 32 ) 
BR( cent ) 



/ ** 

Outer loop transposes 2 (or 1 at end) 32 x 32 tiles per trip 
* */ 

LABEL ( outer loop ) 
/* { */ 

CMPWK NC_left, 32 ) 

LABEL ( cont ) 

ADD{ dstp. A, TCA4 ) /* start prefetch advanced */ 

MR( aptrO, A ) 
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ADD( dstp, dstp. TCA ) /* advanced further */ 

LI ( aindxO, 0 ) 

ADD( aptrl, aptrO, TCA ) 

LI ( aindxl, 16 ) 

ADD( aptr2, aptrl, TCA ) 

MR( cptrO, LI cache ) 

ADD( aptrS, aptr2, TCA ) 

ADDK cptrl, cptrO, 512 ) 

^LOAS'?SSi('aOO, aptrO, aindxO ) begins next sequence -V 

LI( cindxl, 128 ) . ^ . ^ 

LOAD INPUT ( alO, aptrl, amdxO ) 

LI{ cindx2, 256 ) . ^ ^ x 

LOAD INPUT ( a2 0, aptr2, axndxO ) 

LI( cindx3, 384 ) . ^ ^ x 

LOAD INPUT ( a30, aptr3, axndxO ) 
MR{ icount, NR ) 

BLE( input__loop_dol ) 

LI( aindx2, 32 ) /* these are used only in two tile loop */ 

LOAD INPUT { a02, aptrO , aindx2 ) 
LI( aindx3, 48 ) . . ^ x 

13 LOAD INPUT ( al2, aptrl, aindx2 ) 

ADDK cptr2, cptrl, 512 ) . ^ ^ . 

LOAD INPUT ( a22, aptr2 , aindx2 ) 
ADDK cptr3, cptr2, 512 ) . ^ ^ , 

LOAD_INPUT( a32, aptr3, aindx2 ) 



m 



1 -k-k 



I 



Top of input loop processes a 4 x 64 byte tile each trip 

W **/ 

5 LABEL ( input_loop_do2 ) 

M PREFETCH ( dstp, dst code, 0, TCA4 ) 

W ^°^LHi(itra00;^a20?' "/*^tO = aOO[0-3] a20 [0-3] aOO [4-7] a20[4-7] */ 

I "°a™mv;2?°aOO?^afo) ^'/^t^ = a00[8-b] a20[8-bl aOO[c-f] a20[c-f] */ 

g "°^RSw^^l?'ai0?^a30) ^'/^tl = al0[0-3] a30 [0-3] alO[4-V] a30[4-7] 

^°^R^L^^^3?'ai0?^af0) ^'/^t^ = al0[8-b] al0[8-b] a30[c-f] a30[c-f] V 
LOAD_INPUT( a31, aptr3 , aindxl ) 

VMRGHW(cOO, vtO, vtl) /* cOO = aOO [0-3] alO [0-3] a20 [0-3] a30[0-3] */ 

^"SgS,^".^?^?' 1^"^Sl'=aOO[4-7] alO[4-7] a20[4-7] a30 [4-7] V 

'^L^(c02,^vt2,Ttl?' 1*'cL^ a00[8-b] al0[8-b] a20[8-b] a30[8-b] V 

'^GS.^vt2,^?ti;' 1*"fo3^aOO[c-f] alO [c-f ] a20[c-f] a30 [c-f ] V 
STORE_CACHE( c03, cptrO, cindx3 ) 

VMR6HW(vtO, aOl, a21) /* vtO = a01[0-3] a21 [0-3] a01[4-7] a21[4-7] */ 

"°^^Xi2?°loxnfl) ^'/^vt^ = aOX[B-b] a21[8-b] aOX [c-f ] a21 [c-f ] V 

"°^S^il^'alx?^S) ^'/^vti = aXX[0-3] a3X[0-3] aXX[4-7] a3X[4-7] V 

"°aSw^i3?"lx?^afx) ^Tvt^ = aXX[8-b] aXX[8-b] a3X[c-f] a3X [c-f ] V 
LOAD_INPUT( a33, aptr3, aindx3 ) 

VMRGHW(CX0, VtO, vtX) /* cXO = aOX [0-3] aXX[0-3] a2X [0-3] a3X[0-3] */ 

^'^G£S,'v^O,^^^^t' 1^'^XX^aOX[4-7] aXX[4-7] a21 [4-7] a3X[4-7] V 
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a01[8-b] all[8-b] a21[8-b] a31 [8-b] */ 



STORE CACHE ( cll , cptrl , cindxl ) 

VMRGHW(cl2, vt2, vt3) /* cl2 

STORE CACHE { Cl2, cptrl, cindx2 ) 

VMRGLW{cl3, vt2, vt3) /* Cl3 = a01[c-f] all [c-f ] a21[c-f] a31[c~f] 

STORE_CACHE( cl3 , cptrl, cindx3 ) 



BLE( f Iush_input_loop_do2 ) 

ADD( aindxO, aindxO, TCA4 ) 

ADD( aindxl, aindxl, TCA4 ) 

ADD( aindx2, aindx2, TCA4 ) 

ADD{ aindx3, aindx3, TCA4 ) 



/* bump for next load sequence */ 



VMRGHW(vtO, a02, a22) /* vtO 

LOAD INPUT ( aOO, aptrO, aindxO ) 

VMRGLW{vt2, a02, a22) /* vt2 

LOAD INPUT ( a02, aptrO, aindx2 ) 

VMRGHW{vtl, al2, a32) /* vtl 

LOAD INPUT { alO, aptrl, aindxO ) 

VMRGLW(vt3, al2, a32) /* vt3 

LOAD_INPUT{ al2, aptrl, aindx2 ) 



a02[0-3] a22[0-3] a02[4-7 
/*** begins next sequence 
a02[8-b] a22[8-b] a02 [c-f 

al2[0-3] a32[0-3} al2 [4-7 

al2[8-b] al2[8-b3 a32[c-f 



VMRGHW(g20, 


VtO, 


vtl) 


/* c20 = 


a02 [0- 


3] 


al2 [0 


-3] 


a22 [0 


-3] 


a32 [0 


-3] 


*/ 


STORE CACHE ( 


c20 , 


cptr2 , 


cindxO ) 




















VMRGLW{g21, 


VtO, 


vtl) 


/* c21 = 


a02 [4- 


7] 


al2 [4 


-7] 


a22 [4 


-7] 


a32 [4 


-7] 


*/ 


STORE CACHE ( 


c21. 


cptr2 , 


cindxl ) 




















VMRGHW{c22, 


vt2. 


vt3) 


/* c22 = 


a02 [8- 


b] 


al2 [8 


-b] 


a22 [8 


-b] 


a32 [8 


-b] 


V 


STORE CACHE ( 


c22 , 


cptr2 , 


Gindx2 ) 




















VMRGLW(c23, 


Vt2, 


Vt3) 


/* C23 = 


a02 [c- 


f] 


al2 Ec 


-f] 


a22 [c 


-f] 


a32[c 


-f] 


*/ 


STORE CACHE ( 


c23. 


cptr2 , 


cindx3 ) 





















VMRGHW(vtO, a03, a23) /* vtO 

LOAD INPUT ( a2 0, aptr2, aindxO ) 

VMRGLW(vt2, a03, a23) /* vt2 

LOAD INPUT ( a22, aptr2 , aindx2 ) 

VMRGHW(vtl, al3, a3 3) /* vtl 

LOAD INPUT ( a3 0, aptr3 , aindxO ) 

VMRGLW{vt3, al3, a33) /* vt3 

LOAD_INPUT( a32, aptr3, aindx2 ) 



a03[0-3] a23[0-33 a03 [4-7 

a03[8-b] a23[8-b] a03 [c-f 

al3[0-3] a33[0-3] al3 [4-7 

al3[8-b] al3[8-b] a33 [c-f 



VMRGHW(c30, 


VtO, 


vtl) 


/* c30 = 


a03 [0 


-3] 


al3 [0 


-33 


a23 [0 


-3] 


a33 [0 


-3] 


*/ 


STORE CACHE ( 


C3 0, 


cptr3 , 


cindxO ) 




















VMRGLW(C31, 


VtO, 


vtl) 


/* C31 = 


a03 [4 


-7] 


al3 [4 


-73 


a23 [4 


-7] 


a33[4 


-73 


*/ 


STORE CACHE { 


C31, 


Gptr3 , 


cindxl ) 




















VMRGHW(c32, 


Vt2, 


vt3) 


/* c32 = 


a03[8 


-b] 


al3 [8 


-b3 


a23 [8 


-b3 


a33 [8 


-b3 


*/ 


STORE CACHE ( 


c32. 


cptr3 , 


cindx2 ) 




















VMRGLW(c33, 


vt2. 


vt3) 


/* g33 = 


a03 [c 


-f] 


al3 [c 


-f3 


a23 [c 


-f] 


a33 [c 


-f3 


*/ 


STORE CACHE { 


c33. 


cptr3 , 


cindx3 ) 





















a22 [4-7 

it-kit / 

a22 [c-f 
a32 [4-7 
a32 [c-f 



a23 [4-7 
a23 [c-f 
a33 [4-7 
a33 [c-f 



ADDK cindxO, cindxO, 16 ) 

ADDI { cindxl, cindxl, 16 ) 

ADDI ( cindx2, cindx2, 16 ) 

ADDI ( cindx3, cindx3, 16 ) 

BR ( input_loop_do2 ) 

LABEL ( f 1 ush_i npu t_l oop_do2 ) 



/* bump for next store sequence */ 



VMRGHW(vtO, 


a02. 


a22} 


/* 


VtO = 


a02 [0 


-3] 


a22 [0 


-33 


a02 [4- 


7] 


a22 [4- 


73 


*/ 


VMRGLW<vt2, 


a02. 


a22) 


/* 


vt2 = 


a02 [8 


-b] 


a22 [8 


-b] 


a02 [c- 


f3 


a22 [c- 


f3 


*/ 


VMRGHW(vtl, 


al2. 


a32) 


/* 


vtl = 


al2 [0 


-3] 


a32 [0 


-3] 


al2 [4- 


73 


a32 [4- 


73 


V 


VMRGLW(vt3, 


al2. 


a32) 


/* 


vt3 = 


al2 [8 


-b3 


al2 [8 


-b3 


a32 [c- 


f3 


a32 [c- 


f3 


*/ 


VMRGHW(c20, 


VtO, 


vtl) 


/* 


c20 = 


a02[0 


-33 


al2 [0 


-33 


a22[0- 


33 


a32[0- 


33 


*/ 


'ORE CACHE { 


c20. 


cptr2 , 


cindxO ) 




















VMRGLW(c21, 


VtO, 


vtl) 


/* 


c21 = 


a02 [4 


-73 


al2 [4 


-73 


a22 [4- 


73 


a32 [4- 


73 


*/ 



STORE_CACHE( c21, cptr2, cindxl ) 
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/* c22 = 


a02 [8- 


-b] 


al2 [8- 


■b] 


a22 [8- 


-b3 


a32 [8- 


-b3 


*/ 


cindx2 ) 
















-f3 


*/ 


/* c23 = 


a02 [c- 


-f] 


al2 [c- 


-f] 


a22 [c- 


-f3 


a32 [c- 


cindx3 ) 




















/* vtO = 


a03 [0- 


-3] 


a23 [0- 


-3] 


a03 [4 


-73 


a23 [4- 


-73 


*/ 


/* vt2 = 


a03 [8- 


-b] 


a23 [8- 


-b] 


a03 [c 


-f] 


a23 [c 


-f] 


*/ 


/* vtl = 


al3 [0 


-33 


a33E0 


-3] 


al3 [4 


-73 


a33 [4 


-7] 


*/ 


/* vt3 = 


al3 [8 


-b] 


al3 [8 


-b] 


a33 [c 


-f] 


a33 [c 


-f ] 


*/ 


/* c30 = 


a03 [0 


-3] 


al3 [0 


-33 


a23 [0 


-3] 


a33 [0 


-3] 




cindxO ) 
















-73 


*/ 


/* c31 = 


a03 [4 


-7] 


al3 [4 


-7] 


a23 [4 


-73 


a33 [4 


cindxl ) 
















-b] 


*/ 


/* c32 = 


a03 [8 


-b] 


al3 [8 


-b] 


a23 [8 


-b3 


a33 [8 


cindx2 ) 














a33 [c 


-f3 


*/ 


/* c33 = 


a03 [c 


-f] 


al3 [c 


-f] 


a23 [c 


-f] 


cindx3 ) 
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VMRGHW(c22, Vt2 ^ vt3) 
STORE CACHE ( c22, cptr2 , 

VMRGLW(c23, vt2, Vt3) 
STORE_CACHE{ c2 3, cptr2 , 

V]yiRGHW(vtO, a03, a23) 

VMRGLW(vt2, a03, a2 3) 

VMRGHW(vtl, al3, a33) 

VMRGLW(vt3, al3, a33) 

VMRGHW(c30, VtO, vtl) 
STORE CACHE ( c30, cptr3 , 

VMRGLW(c31, VtO, vtl) 
STORE CACHE ( c31, cptr3 , 

VMRGHW(c32, vt2 , vt3) 
STORE CACHE ( c32, cptr3 . 

VMRGLW(g33, Vt2, vt3) 
STORE_CACHE{ c33, cptr3 , 

MR( outptrO, C ) /* set for output loop in current pass */ 

SLWI( Cbutnp, TCC, 6 ) 
ADDK A, A, 64 ) 

ADD( C, C, Cbump ) /* bump C for next pass *i 

LI( icount, 64 ) /* set icount for 2 tiles */ 

O BR( output_start ) /* join to common output loop */ 

H 

w Top of input loop processes a 4 x 32 byte tile each trip 

%e **/ 

LABEL ( input_loop_dol ) 
/* { */ 

PREFETCH ( dstp, dst code, 0, TCA4 ) 
III ADDIC C( icount, icount, -4 ) r ««r. a.y 

r* VMRGHW{vtO, aOO, a20) /* vtO = aOO [0-33 a20 [0-3] aOO[4-73 a20 [4-73 */ 

L. LOAD INPUT( aOl, aptrO, aindxl ) r , , r o« r ^i * / 

O VMRGLW(vt2, aOO, a20) /* vt2 = aOO [8-b3 a20 [8-b3 aOO [c-f 3 a20 [c-f 3 */ 

hi LOAD INPUT ( all, aptrl, aindxl ) . , .^r. 

n VMRGHW(vtl, alO, a30) /* vtl = alO [0-33 a30 [0-3] alO[4-73 a30 [4-73 */ 

LOAD INPUT ( a21, aptr2, aindxl ) r ^, -,^r ^i 

=p VMRGLW(vt3, alO, a30) /* vt3 = alO [8-b3 alO [8-b3 a30 [c-f 3 a30 [c-f 3 */ 

{H LOAD_INPUT( a31, aptr3, aindxl ) 

W VMRGHW(cOO, VtO, vtl) /* cOO = aOO [0-33 alO [0-33 a20 [0-33 a30[0-33 */ 

STORE CACHE ( cOO, cptrO , cindxO ) 

VMRGLW(c01, VtO, vtl) /* cOl = aOO [4-73 alO [4-7] a20[4-73 a30[4-73 */ 
STORE CACHE ( cOl, cptrO , cindxl ) r ^, « r« i^i * / 

VMRGHW{c02, vt2, vt3) /* c02 = aOO [8-b3 alO [8-b3 a20[8-b] a30 [8-b] */ 
STORE CACHE ( c02, cptrO, cindx2 ) , r ^, ^ / 

VMRGLW(c03, vt2, vt3) /* c03 - aOO [c-f 3 alO [c-f 3 a20 [c-f 3 a30 [c-f ] */ 
STORE_CACHE( c03, cptrO, cindx3 ) 

BLE( f lusli_input_loop_dol ) 

ADD( aindxO, aindxO, TCA4 ) /* bump for next load sequence */ 
ADD{ aindxl, aindxl, TCA4 ) 

VMRGHW(vtO, aOl, a21) /* vtO = aOl [0-33 a21[0-3] aOl [4-7] a21[4-7] */ 

LOAD INPUT ( aOO, aptrO, aindxO ) /*** begins next sequence ***/ 

VMRGLW(vt2, aOl, a21) /* vt2 = aOl [8-b3 a21 [8-b] aOl [c-f ] a21 [c-f 3 */ 

LOAD INPUT ( alO, aptrl, aindxO ) r , / 

VMRGHW(vtl, all, a31) /* vtl = all [0-33 a31 [0-33 all [4-7] a31 [4-7] */ 

LOAD INPUT ( a20, aptr2 , aindxO ) . ^, . 

VMRGLW(vt3, all, aSl) /* vt3 = all [8-b] all [8-b3 a31 [c-f ] a31[c-f] */ 

LOAD_INPUT( a30, aptr3, aindxO ) 

VMRGHW(clO, VtO, vtl) /* clO = aOl [0-3] all [0-33 a21[0-33 a31 [0-3] */ 
STORE_CACHE( clO, cptrl, cindxO ) 
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a01[4-7] all [4-7] a21[4-7] a31[4-7] */ 



VMRGLW(cll, vtO, vtl) /* cll 

STORE CACHE ( cll, cptrl, cindxl ) 

VMRGHW(cl2, vt2, vt3) /* cl2 = aOl [8-b] all [8-b] a2lE8-b3 a31 [8-b] */ 

STORE CACHE { cl2, cptrl , cindx2 ) 

VMRGLW(cl3, vt2, vt3) /* cl3 = aOlEc-f] all [c-f ] a21[c-f] a31 [c-f ] */ 

STORE_CACHE( cl3, cptrl , cindx3 ) 



ADDI( cindxO, cindxO, 16 ) 

ADDI( cindxl, cindxl, 16 ) 

ADDI { cindx2, cindx2, 16 ) 

ADDI{ cindx3, cindx3, 16 ) 

BR ( input_loop_dol ) 

LABEL ( f lush_input_loop_dol ) 



/* bump for next store sequence */ 



VMRGHW(vtO, 


aOl, 


a21) 


/* vtO 




aOl [0 


-3] 


a21 [0 


-3] 


aOl [4 


-7] 


a21 [4- 


7] 


*/ 


VMRGLW(vt2, 


aOl, 


a21) 


/* vt2 




aOl [8 


-b] 


a21 [8 


-b] 


aOl [c 


-f] 


a21 [c- 


f3 


*/ 


VMRGHW(vtl, 


all. 


a31) 


/* vtl 




all [0 


-3] 


a31 [0 


-3] 


all [4 


-7] 


a3lC4- 


7] 


*/ 


VMRGLW{vt3, 


all, 


a31) 


/* vt3 




all [8 


-b] 


all [8 


-b3 


a31 [c 


-f] 


a31[c- 


f] 


*/ 


VMRGHW(clO, 


vtO, 


vtl) 


/* clO 




aOl [0 


-3] 


all [0 


-3] 


a21 [0 


-3] 


a31 [0- 


•3] 


*/ 


STORE CACHE { 


clO, 


cptrl , 


cindxO ) 






















VMRGLW(cll, 


vtO, 


vtl) 


/* cll 




aOl [4 


-7] 


all [4 


-7] 


a21 [4 


-73 


a31 [4- 


•7] 


*/ 


STORE CACHE ( 


cll. 


cptrl , 


cindxl ) 






















VMRGHW(cl2, 


vt2. 


vt3) 


/* cl2 




aOl [8 


-b] 


all [8 


-b] 


a21 [8 


-b3 


a31 [8- 


■b] 


*/ 


STORE CACHE ( 


cl2. 


cptrl , 


cindx2 ) 




















*/ 


VMRGLW(cl3, 


vt2. 


vt3) 


/* cl3 




aOl [c 


-f] 


all [c 


-f] 


a21 [c 


-f] 


a31 [c- 


■f] 


STORE CACHE ( 


cl3. 


cptrl , 


cindx3 ) 























MR( outptrO, C ) 
SLWK Cbump, TCC, 5 ) 
ADDK A, A, 32 ) 
ADD ( C , C , Cbump ) 
LI ( icount, 32 ) 



/* set for output loop in current pass */ 



/* bump C for next pass */ 
/* set icount for 1 tile */ 



Second stage of transposition, write output 
LABEL ( output_start ) 



CMPW_CR( 6, icount, NC_left ) 
MR( cptr, LI cache ) 



SLWI ( TCC4 , 
LI { cindxO, 
LI ( cindxl, 
LI { cindx2 , 
LI { cindx3 , 
LI ( cindx4 , 
LI ( cindxS, 
LI ( cindx6 , 



TCC, 
0 ) 
16 ) 
2*16 ) 
3*16 ) 
4*16 ) 
5*16 ) 
6*16 ) 



2 ) 



BLE_CR{ 6, PC OFFSET ( 8 ) ) 
MR( icount, NC_left ) 

LI ( cindxV, 7*16 ) 

SUB( NC_left, NC_left, icount ) 

ADDIC C{ icount, icount, -4 ) 
LI ( out indxO, 0 ) 

LOAD CACHE ( cO, cptr, cindxO ) 

ADD( out indxl, out indxO, TCC ) 

LOAD CACHE ( cl, cptr, cindxl ) 

ADD ( out indx2 , out indxl , TCC ) 

LOAD CACHE ( c2 , cptr, cindx2 ) 

ADD( out indx3, out_indx2, TCC ) 
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LOAD CACHE ( c3, cptr, cindx3 ) 

ADDI { outptrl, outptrO, 16 ) 

LOAD CACHE ( c4 , cptr, cindx4 ) 

VPERM( vtO, cO, cl, vpO ) 

LOAD CACHE { c5, cptr, cindx5 ) 

VPERM{ vtl, CO, cl, vpl ) 

LOAD CACHE ( c6, cptr, cindx6 ) 

VPERM( vt2, c2, c3, vpO ) 

LOAD CACHE ( c7, cptr, cindx7 ) 

VPERM( Vt3, C2, C3, vpl ) 
ADDI( cptr, cptr, 128 ) 
BR{ output_mloop ) 

/** 

Loop outputs four 32 byte rows 

LABEL ( output loop ) 

ADDIC_C( icount, icount, -4 ) 
ADDI { cptr, cptr, 128 ) 



STORE OUTPUT ( outO, 

VPERM{ out4, vt4, 
STORE OUTPUT ( out4 , 

VPERM( out5, vt4, 
STORE OUTPUT { outl, 

VPERM( out 6, vt5, 
STORE OUTPUT ( out 5, 

VPERM( out7, Vt5, 



outptrO, out_indxO ) 
vt6, vp2 ) 

outptrl , out_indxO ) 
vt6, vp3 ) 

outptrO, out_indxl ) 
vt7, vp2 ) 

outptrl , out_indxl ) 
vt7, vp3 ) 



IS 

m 
y 



y 

Tap: 

m 



STORE OUTPUT { out2 , outptrO, 

VPERM( VtO, cO, cl, vpO ) 
STORE OUTPUT ( out6, outptrl, 

VPERM( vtl, cO, cl, vpl ) 
STORE OUTPUT { out3, outptrO, 

VPERM( vt2, g2, c3, vpO ) 
STORE OUTPUT { out7, outptrl, 

VPERM( vt3, c2, c3, vpl ) 

ADD{ outptrO, outptrO, TCC4 ) 
ADD( outptrl, outptrl, TCC4 ) 



LABEL { output mloop ) 

BLE ( flush output_loop ) 

LOAD CACHE ( cO, cptr, cindxO ) 

VPERM( vt4, c4, c5, vpO ) 
LOAD CACHE ( cl, cptr, cindxl ) 

VPERM{ vt5, c4, c5, vpl ) 
LOAD CACHE ( c2, cptr, cindx2 ) 

VPERM( vt6, c6, c7, vpO ) 
LOAD CACHE ( c3, cptr, cindx3 ) 

VPERM( vt7, c6, C7, vpl ) 



out_indx2 ) 

out__indx2 ) 

out_indx3 ) 

out indx3 ) 



LOAD CACHE ( c4 , cptr, cindx4 ) 

vp2 ) 
:indx5 ) 

vp3 ) 
:indx6 ) 
vp2 ) 
:indx7 ) 
vp3 ) 



VPERM( outO, vtO, vt2, 
LOAD CACHE ( c5, cptr, ci 
VPERM( outl, VtO, vt2, 
LOAD CACHE ( c6, cptr, c: 
VPERM( out2, vtl, vt3, 
LOAD CACHE ( c7, cptr, c: 
VPERM( out3, vtl, vt3. 



BR ( output_loop ) 

LABEL ( f lush_output_loop ) 

VPERM( vt4, C4, C5, vpO ) 
VPERM( vt5, C4, C5, vpl ) 
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VPERM( vt6, c6, c7, vpO ) 
VPERM( vt7, c6, c7, vpl ) 

CMPWK icount, -3 ) 

VPERM( outO, vtO, vt2, vp2 ) 

STORE OUTPUT { outO, outptrO, out_indxO ) 

VPERM( out4, Vt4, vt6, vp2 ) 

STORE OUTPUT ( out4, outptrl, out_indxO ) 
BEQ ( oloop_next ) 

CMPWI( icount, -2 ) 

VPERM( outl, vtO, vt2, vp3 ) 

STORE OUTPUT ( outl, outptrO, out_indxl ) 

VPERM( out5, vt4, vt6, vp3 ) 

STORE OUTPUT ( out 5, outptrl, out_indxl ) 
BEQ( oloop_next ) 



CMPWK icount, -1 ) 

VPERM{ out2, vtl, vt3, vp2 ) 

STORE OUTPUT ( out2 , outptrO, out_indx2 ) 

VPERM( out6, vt5, vt7, vp2 ) 

STORE OUTPUT ( out6, outptrl, out_indx2 ) 
BEQ ( oloop_next ) 

VPERM( outB, vtl, vt3, vp3 ) 

STORE OUTPUT ( out3, outptrO, out_indx3 ) 

VPERM( out7, Vt5, vt7, vp3 ) 

STORE_OUTPUT { out7, outptrl, out_in(ix:3 ) 



s 



nj 



Next four rows of C? 
* */ 

LABEL ( oloop next ) 

BLT__CR( 6, outer_loop ) 

Exit routine 
LABEL ( ret ) 

FREE THRU v31 ( VRSAVE_COND ) 

REST rl3_r28 

RETURN 

FUNG EPILOG 



/* branch if icount < NC left */ 
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MC Standard Algorithms -- PPC Macro language Version 



File Name: mtriangle_8bit .mac . 

Description: Move from an upper triangular matrix stored 
as a series of 32 -line rectangles, each of 
width 32 elements less than its immediate 
predecessor to the upper triangle of an 
full N X N matrix, 

mtriangle__8bit ( char *A, char *C, int N ) 

Restrictions: A, B and C must all be 16-byte aligned. 

N must be a multiple of 16 and >= 16. 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 

0.0 000605 jg Created 



#include " salppc . inc " 



#define LOAD A( vT, rA, rB ) 
#define LOAD C( vT, rA, rB ) 
ttdefine STORE_C( vS, rA, rB ) 

#def ine R MATRIX ALIGN_LOG 
#def ine R MATRIX ALIGN 
#define R_MATRIX_ALIGN_MASK 

#define ALTIVEC ALIGN_LOG 
#define ALTIVEC ALIGN 
#define ALTIVEC_ALIGN_MASK 

/ * * 

Input parameters 



* */ 






#def ine 


A 


r3 


#def ine 


C 


r4 


#def ine 


N 


r5 


#def ine 


A tcols 


r6 


#define 


C tcols 


r7 


#define 


altivec N 


r8 


#def ine 


A counter 


r9 


#define 


indexO 


rlO 


#define 


indexl 


rll 


#def ine 


index2 


rl2 


#define 


index3 


rl3 


#def ine 


count 


rO 


#def ine 


aO 


vO 


#def ine 


al 


vl 


#def ine 


a2 


v2 


#def ine 


a3 


v3 


#def ine 


CO 


v4 


#def ine 


shift 


v5 


#def ine 


shif t_incr 


v6 


#def ine 


mask 


v7 


#def ine 


left 


v8 


#def ine 


right 


v9 



LVXL( vT, rA, rB ) 
LVX( vT, rA, rB ) 
STVX( vS, rA, rB ) 

5 

(1 « R MATRIX ALIGN_LOG) 
(R MATRIX_ALIGN - 1) 



(1 << ALTIVEC ALIGN_LOG) 
(ALTIVEC_ALIGN - 1) 
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FUNC PROLOG 



ENTRY_3( Tntriangle_8bit, A, C, N ) 
SAVE rl3 

USE_THRU_v9( VRSAVE_COND ) 

ADDI( A counter, N, R MATRIX ALIGN_MASK ) 

VSPLTISW( shift_incr, 8 ) 
ADDK altivec N, N, ALTIVEC ALIGN_MASK ) 

VXOR( shift, shift, shift ) 
RLWINM( A coiinter, A counter, 0, 0, (31 - R MATRIX ALIGN LOG) ) 
RLWINM( altivec N, altivec N, 0, 0, (31 - ALTIVEC_ALIGN_LOG ) ) 
ADDI ( C_tcols, A_counter, 1 ) 



LABEL ( oloop ) 



a 



m 
m 
u 



m 



ADDIC C( count, altivec_N, -64 ) 
LOAD C( cO, 0, C ) 

VSPLTISW( mask, -1 ) 
LOAD A{ aO, 0, A ) 

VSRO{ mask, mask, shift ) 
LI( indexO, 16 ) 

VANDC( left, GO, mask ) 
LI( indexl, 32 ) 

VAND( right, aO, mask ) 
LI( index2, 48 ) 

VOR( cO, left, right ) 
STORE C( cO, 0, C ) 
BLE( dosmall ) 
LI{ index3, 64 ) 



LABEL ( iloop ) 
LOAD A( aO, 

ADDIC C ( count 
LOAD A{ al, 
LOAD A( a2, 
LOAD A( a3, 
STORE C( aO, 

ADDI ( indexO , 
STORE C( al, 

ADDI( indexl, 
STORE C( a2, 

ADDI ( index2 , 
STORE C( a3, 

ADDI ( index3 , 

BGT{ iloop ) 



A, indexO ) 

count, -64 ) 
A, indexl ) 
A, index2 ) 
A, index3 ) 

C, indexO ) 
indexO, 64 ) 
C, indexl ) 
indexl, 64 ) 
C, index2 ) 
index2, 64 ) 
C, index3 ) 
index3, 64 ) 



LABEL ( dosmall ) 

ADDIC C( count, count, 48 ) 
BLE ( windout ) 



LABEL { sloop ) 

ADDIC C{ count, count, -16 ) 
LOAD A( aO, A, indexO ) 
STORE C{ aO, C, indexO ) 

ADDK indexO, indexO, 16 ) 

BGT( sloop ) 



LABEL ( windout ) 
DECR__C( N ) 

VADDUWM( shift, shift, shift_incr ) 
ADDI ( A counter, A_counter, -1 ) 
ADDI ( A, A, 1 ) 

ADDK A tools, A counter, R_MATRIX_ALIGN_MASK ) 
DECR( altivec_N ) 
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RLWINM( A tcols, A_tCols, 0, 0, (31 - R_MATRIX_ALIGN_LOG) ) 
ADD( C, C, C tcols ) 
ADD( A, A, A_tCOls ) 
BNE( oloop ) 

FREE THRU_v9( VRSAVE_COND ) 

REST rl3 

RETURN 

FUNC EPILOG 



W 

y 



3 
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#if I defined ( SALPPC_H ) 
#define SALPPC_H 



#if 
4.** 

** 

** 



m 
u 



In 



*********************************************************************** 

* MC Standard Algorithms -- PPC Version 

********************************************************************** *^ 



File Name: 
Description: 



salppc . h 

SAL macro include file 



Source files should have extension .mac. For example, vadd.mac 
and must include this file ( salppc. h) . 

To assemble for PPC ucode, use the following basic 
makefile build rule: 

.SUFFIXES: .mac .c .s .o 

.mac.o: 

cp $*.mac $*.c 
ccmc -o $*.s -E $*.c 
ccmc -c -o $*.o $*.s 
rm -f $*.s 
rm -f $*.c 

To compile for C, use the following basic makefile build rule: 

.SUFFIXES: .mac .c .o 

. mac . o : 

cp $*.mac $*.c 

ccmc -DCOMPILE_C -c -o $*,0 $*.c 
rm -f $*.c 

The first 8 function arguments are passed in GPR registers 
r3 - no. Arguments beyond 8 are passed on the stack and may- 
be obtained with the GET_ARG8, GET_ARG9, ... GET ARG15 macros. 
Additional GPR registers should be assigned in ascending order 
starting from the last function argument. These may be declared 
with the DECLARE_rx[ ry] macros. For example, a function with 
5 arguments that requires 3 additional GPR registers would 
issue: DECLARE r8 rlO , rO, if required, should be declared 
separately with the DECLARE rO macro. GPR registers above rl2 
must be saved and restored using the SAVE_rl3 [_ry] and 
REST_rl3 [_ry] macros, respectively. 

FPR registers should be assigned in ascending order starting 
with fO[d0 3 . These may be declared with the DECIjARE_f 0 [_f y] 
or DECLARE dO [ dy] macros . 

For exatt^le, DECLARE fO fll. FPR registers above f 13 [dl3] must 
be saved and restored using the SAVE f 14 [ fy] and REST f 14 [_f y] 
or SAVE_dl4[_dy] and REST_dl4 [_dy] macros, respectively. 

All variables must be assigned a register using the 
pre-processor #define directive. GPR registers are named 
rO - r31; Single precision FPR registers are named fO - f31. 
Double precision FPR registers are named dO - d31. Different 
variables may be assigned to the same register as in: 



#define vara 
#define varb 



fl2 
fl2 



Functions must begin with the FUNC_PROLOG macro and end 
with the FUNC EPILOG macro. 
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* 
* 


Macros are 


provided for both Fortran and C entry points. 


* 
* 


* 


The GET SALCACHE macro should be used to get the address of 




* 
* 


the "current" 


salcache buffer into a GPR register. 


* 
* 


* 
* 


Avoid terminating macro lines with a semicolon. 


* 
* 


* 


The following 


example demonstrates typical usage: 


* 
* 


* 
* 


#include " salppc, h" 


* 
* 


* 


/* 






* 


* 


* assign variables to registers 


* 




*/ 






-k 




#def ine 


A 


r3 


it 


* 


^define 


I 


r4 


* 


* 


itdef ine 


B 


r5 


* 


* 


#def ine 


J 


r6 


* 


* 




C 


r7 




* 




K 


r8 


* 


* 


itd^f inp 

TT^vX. ^XXC 


D 


r9 


* 


■* 




L 


rlO 


* 




Jf^^f* "5 n A 

TT ^^'* — J— -i- J- Xn3. 


N 


rl2 




* 


yy w -L JL i X ^. — 


EFLAG rll 


* 


* 
* 


Jirl<=i'F "i 


count rll 


* 




#def ine 


to 


rl3 


* 




#def ine 


tl 


rl3 


* 


* 


#def ine 


t2 


rl4 


* 


* 


# define 


t3 


rl4 


* 


* 


#def ine 


t4 


rl5 


* 


* 


#def ine 


t5 


rl5 


* 


* 


- #define 


t6 


rl6 


* 
* 


* 


#def ine 


aO 


f 0 


* 


* 


#def ine 


al 


fl 




* 


#def ine 


a2 


f2 


* 


* 


#def ine 


a3 


f3 


* 


* 


# define 


bO 


f4 


* 




#def ine 


bl 


f5 


* 


* 


#define 


b2 


f6 


* 




#define 


b3 


tl 




* 


#def ine 


CO 


f8 


* 


* 


#def ine 


cl 


f9 


* 


* 


#define 


c2 


flO 


* 


* 


#def ine 


c3 


fll 


* 


* 


#def ine 


do 


fl2 




* 


#def ine 


dl 


fl3 


* 


* 


#def ine 


d2 


fl4 


* 


* 

* 


#def ine 


d3 


fl5 


* 
* 


* 
* 


FUNC_PROLOG 


/* must precede function */ 


* 
* 


* 


#if ! defined { 


COMPILE C ) 


* 


•k 


U ENTRY 


(foo ) 


-k 


* 


FORTRAN 


DREF 4 (I, J, K, L) 


* 


* 
* 


FORTRAN 


DREF ARG8 


* 
* 


* 


U ENTRY (foo) 


* 


* 


LKEFLAG, 0) 


* 


•k 
* 


BR (common) 




•k 
* 


* 


U ENTRY {foo X ) 


* 


* 


FORTRAN 


DREF 4(1, J, K, L) 




* 


FORTRAN 


DREF ARG8 


* 


* 


FORTRAN 


DREF ARG9 


* 


* 


#endif 






* 
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ENTRY 10(foo X, A, I, 
DECLARE rl3 rl6 
DECLARE fO fl5 
GET__ARG9( EFLAG ) 

LABEL ( common) 

SAVE OR 
SAVE rl3 rl6 
SAVE fl4_fl5 
SAVE_LR 

GET ARG8( N ) 



B, J, C, K, D, L, N, EFLAG) 
/* get the 9 » th arg (EFLAG) off stack */ 

/* needed if using fields 2,3 or 4 */ 

/* needed if making a function call */ 
/* get the 8'th arg (N) off stack */ 



body of function 



REST CR 
REST rl3 rl6 
REST fl4_fl5 
REST LR 
RETURN 

FUNG EPILOG 



/* must conclude fimction */ 



Mercury Computer Systems, Inc. 
Copyright (c) 1996 All rights reserved 



* Revision 



Date 



Engineer; Reason 



0.0 
1 



0 



0.2 



0.3 

0.4 



960223 
970109 



970124 

970521 
980813 



jfk; 



jfk; 



jfk; 



Created * 

Added POSTING BUFFER COUNT and made * 

TEST IF DCBZ macro time "stw" instead * 

of doing the TEST IF DCBT macro (Iwz) * 

Added SALCACHE ALLOC SIZE , * 

ALIGN SALCACHE, CREATE^SALCACHE^FRAME * 

DESTROY SALCACHE FRAME * 

Added SET DCB [TZ] COND macros . * 

Made old macros not assemble * 

^ ^«v.«^^ jfk; Changes SALCACHE ALLOC SIZE for 750 * 
.. ***************************************************** *******^ 
#endif /* header */ 



#include <math.h> 



#define uchar unsigned char 
#define ulong unsigned long 
#define ushort unsigned short 

#define CR __cr 
#define CTR _ctr 
#def ine VSCR _vscr 



* define a structure to represent a VMX register 

*/ . r 

typedef union { 

char c [16] ; 
uchar uc [16] ; 
short s [8] ; 
ushort us [8] ; 
long 1[4]; 
ulong ul [4] ; 
float f [4] ; 
} VMX_reg; 

#define FUNC_PROLOG 
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#define FUNC EPILOG \ 
} 

#define TEXT_SECTION ( logb2__align ) 
#define DATA_SECTION ( logb2_align ) 
#define R0DATA_SECTION ( logb2_align ) 
/* 

* macro for C extern declarations 
*/ 

#define EXTERN_DATA{ symbol ) \ 
extern long symbol; 

#define EXTERN_FUNC( func ) \ 
extern void twcic ( void ) ; 

/* 

* macro for a global declaration 
*/ 

H= ttdefine GLOBAL ( symbol ) 

/* 

-f!" * macro for a local declaration 

*/ 

#define LOCAL ( symbol ) 

b /* 

© * macros for creating static arrays 

U */ 

#define START_ARRAY( type, name ) \ 
type name##C] = { 



1,11 #define START C ARRAY ( name ) START ARRAY ( char, name ) 

#define START UC ARRAY ( name ) START ARRAY ( uchar, name ) 

#define START S ARRAY ( name ) START ARRAY { short, name ) 

42 #define START US ARRAY ( name ) START ARRAY ( ushort, name ) 

fh #define START L ARRAY { name ) START ARRAY ( long, name ) 

Jj^ #define START UL ARRAY ( name ) START ARRAY ( ulong, name ) 

ly #define START__F_ARRAY { name ) START_ARRAY{ float, name ) 

#def ine EW) ARRAY \ 

}; 

#define DATA( dl ) \ 
dl, 

#define DATA2 ( dl, d2 ) \ 
dl, d2, 

#define DATA4 ( dl, d2, d3, d4 ) \ 

dl, d2, d3, d4, 

#define DATA 8 ( dl, d2, d3, d4, d5, d6, d7, d8 ) \ 

dl, d2, d3, d4, dS, d6, d7, d8, 

#define C DATA( dl ) DATA( dl ) 

#define UC DATA( dl ) DATA( dl ) 

#d€fine S DATA( dl ) DATA( dl ) 

#define US DATA{ dl ) DATA( dl ) 

#define L DATA( dl ) DATA( dl ) 

#define UL DATA( dl ) DATA( dl ) 

#define F__DATA{ dl ) DATA ( dl ) 

#if defined ( LITTLE ENDIAN ) 
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#define D_DATA( dl, d2 ) 
#eise 

#define D_DATA{ dl, d2 ) 
#endif 

#define C DATA2 { dl, d2 ) 
#define UC DATA2 ( dl, d2 ) 
#define S DATA2 ( dl, d2 ) 
#define US DATA2 ( dl, d2 ) 
#define L DATA2 ( dl, d2 ) 
#define UL DATA2 ( dl, d2 ) 
#define F_DATA2 ( dl, d2 } 



DATA2( d2, dl ) 
DATA2{ dl, d2 ) 



DATA2 ( dl , d2 ) 

DATA2 ( dl , d2 ) 

DATA2{ dl, d2 ) 

DATA2 ( dl , d2 ) 

DATA2 ( dl , d2 ) 

DATA2( dl, d2 ) 

DATA2{ dl, d2 ) 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 

#def ine 

#def ine 

#def ine 

#def ine 

#def ine 

#def ine 

#def ine 



C DATA4 ( dl , d2 , d3 , d4 ) 
UC DATA4 { dl , d2 , d3 , d4 ) 
S DATA4 ( dl , d2 , d3 , d4 ) 
US DATA4 ( dl , d2 , d3 , d4 ) 
L DATA4( dl, d2, d3 , d4 ) 
UL DATA4( dl, d2, d3, d4 ) 
F_DATA4 ( dl , d2 , d3 , d4 ) 



DATA4( dl, d2, d3, d4 ) 

DATA4( dl, d2, d3, d4 ) 

DATA4{ dl, d2, d3 , d4 ) 

DATA4{ dl, d2, d3, d4 ) 

DATA4( dl, d2, d3, d4 ) 

DATA4( dl, d2, d3, d4 ) 

DATA4( dl, d2, d3 , d4 ) 



C DATA8( dl, d2, d3, d4, d5, d6, d7, d8 ) \ 

DATA8( dl, d2, d3, d4, d5, d6, d7, dB ) 

UC DATA8( dl, d2, d3, d4, d5, d6, d7, d8 ) \ 

DATA8( dl, d2, d3 , d4, d5, d6, d7, d8 ) 

S DATA8( dl, d2, d3, d4 , d5, d6, d7 , d8 ) \ 

DATA8( dl, d2, d3 , d4 , d5 , d6 , d7 , d8 ) 

US DATA8( dl, d2 , d3 , d4, d5, d6 , d7 , d8 ) \ 

DATA8( dl, d2, d3, d4, d5, d6, d7, d8 ) 

L DATA8( dl, d2, d3 , d4 , d5, d6, d7, d8 ) \ 

DATA8( dl, d2, d3, d4 , d5 , d6, d7, d8 ) 

UL DATA8{ dl, d2, d3 , d4 , d5, d6, d7, d8 ) \ 

DATA8{ dl, d2, d3 , d4 , d5 , d6 , d7 , dS ) 

F DATABC dl, d2, d3 , d4 , d5, d6, d7, dS ) \ 

DATA8( dl, d2, d3 , d4, d5, d6, d7, d8 ) 



/* 



macros for creating vmx permute masks (12 8 -bits) 



#if defined { LITTLE_ENDIAN ) 

#define L PERMUTE MUNGE ( 1 ) ( (1) ^ Oxlclclclc ) 

#define S PERMUTE MUNGE ( s ) { (s) ^ Oxlele ) 

#define C_PERMUTE_MUNGE { c ) ((c)" Oxlf ) 

^define L INDEX MUNGE ( x ) ( (x) " 0x3 ) 

#define S INDEX MUNGE ( x ) ( (x) " 0x7 ) 

#define C__INDEX_MUNGE ( x ) ( (x) " Oxf ) 



2/23/2001 



#else 

II define L PERMUTE 
lldefine S PERMUTE 
define C_PERMUTE 



MUNGE ( 1 ) ( 1 ) 
MUNGE ( s ) ( S ) 
MUNGE ( c ) ( C ) 



#define L INDEX 
#define S INDEX 
#define C_INDEX 



MUNGE ( x ) ( x ) 
MUNGE ( X ) ( X ) 
MUNGE ( X ) ( X ) 



#endif 

#define L PERMUTE MASK( 11, 12, 13, 14 ) \ 

L PERMUTE MUNGE ( 11 ) , L PERMUTE MUNGE ( 12 ) , \ 

L_PERMUTE_MUNGE ( 13 ), L_PERMUTE_MUNGE ( 14 ), 

#define S PERMUTE MASK( si, s2, s3 , s4, s5, s6, s7, s8 ) \ 
S PERMUTE MUNGE { Sl ), S_PERMUTE_MUNGE ( s2 ) , \ 
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S PERMUTE MUNGE( s3 ) , S PERMUTE MUNGE ( S4 ) , \ 
S PERMUTE MUNGE ( s5 ) , S PERMUTE MUNGE s6 ) , \ 
S_PERMUTE_MUNGE ( s7 ), S_PERJMUTE_MUNGE { s8 ), 

#define C.PERMUTE^MASK ( cl, f,^, ^ ^^i,^^^ ' c^!', cl4,'ci5) cl6 ) \ 

C PERMUTE MUNGE ( cl ) , C PERMUTE MUNGE ( c2 ) , \ 

C PERMUTE MUNGEl c3 ) , C PERMUTE MUNGE c4 . \ 

C PERMUTE MUNGE { c5 ) , C PERMUTE MUNGE c6 , \ 

C PERMUTE MUNGE ( c7 ) , C PERMUTE MUNGE c8 ) , \ 

C PERMUTE MUNGE ( c9 ) , C PERMUTE MUNGE ( clO) , \ 

C PERMUTE MUNGE ( cll ) , C PERMUTE MUNGE ^12 K \ 

C PERMUTE MUNGE ( cl3 ) , C PERMUTE MUNGE cl4 , \ 

C__PERMUTE_MUNGE ( cl5 ), C_PERMUTE_MUNGE { Cl6 ), 

^* macro for a microcode entry point (e.g. vaddx, vaddx^) 
* u_ENTRY is a "nop" for C code 
*/ 

#define U_ENTRY( func__name ) 
/* 

1^. * macros for C function prototypes 

1^ */ 

H ttdefine C PROTOTYPE_0 ( func__name ) \ 

13 void fianc_name { void ) ; 

X #define C PR0T0TYPE_1 ( func_name ) \ 

void f unc name ( long ) ; 

£g 

m #def ine C PROTOTYPE_2 ( f unc name ) \ 

void func name ( long, long ) ; 

W " 

5 idefine C PROTOTYPE_3 ( func name ) \ 

void func_name ( long, long, long ) ; 

%U #def ine C PR0T0TYPE_4 ( func name ) \ 

l^: void func_name ( long, long, long, long ) ; 



2 



#define C PR0T0TYPE_5 ( func name ) \ 

void func__name { long, long, long, long, long ); 

ttdefine C PROTOTYPE 6( func name ) \ x 
void func_name ("long, long, long, long, long, long ) ; 

:ar}<=-F-i np c PROTOTYPE 7( func name ) \ , ^ x 

* vo?d func_name ("long, long, long, long, long, long, long ) ; 

iff=?f:>f -i ne C PROTOTYPE 8( func name ) \ t \ . 

* vo?l iunfname {"long, long, long, long, long, long, long, long ) ; 

JiHf:if-inf^ C PROTOTYPE 9( func name ) \ . -, n \ 

* loTa funfname ("long, long, long, long, long, long, long, long, \ 

long ) ; 

ttdefine C PROTOTYPE 10 ( func name ) \ \ 

void func_name ("long, long, long, long, long, long, long, long, \ 
long, long ) ; 

ttdefine C PROTOTYPE 11 ( func name ) \ -. n t \ 

llll func_name ("long, long, long, long, long, long, long, long, \ 

long, long, long ) ; 

ad^f ine C PROTOTYPE 12 ( func name ) \ , -i n ™« \ 

vo?l f unc_name ("long, long, long, long long, long. long, long, \ 
long, long, long, long ) ; 
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#def ine C PR0T0TYPE_13 ( f unc name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long ) ; 

#def ine C PROTOTYPE_14 ( f unc name ) \ 

void func_name { long, long, long, long, long, long, long, long, \ 
long, long, long, long, long, long ) ; 

#define C PR0T0TYPE_15 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long, long, long ) ; 

#define C PROT0TYPE_16 { func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long, long, long, long ) ; 

#define AUTO_r3 r31 \ 

long r3, r4, r5, r6, r7, r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO_r4 r31 \ 

long r4, r5, r6, r7, rS, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 
rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
H r31; 
iS #define AUTO_r5 r31 \ 

long r5, r6, r7, r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 
^0 rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 

^.0 r31; 
i% #define AUT0_r6 r31 \ 

long r6, r7, r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
id r31; 
"''^ #define AUTO__r7 r31 \ 

^ long r7, r8, r9, rlO, rll, rl2 , rl3, rl4, rl5, rl6, rl7, \ 

13 rlB, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 

#define AUT0_r8 r31 \ 
1^ long r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

^£ rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 

fU r31; 

If: #define AUT0_r9 r31 \ 

IsJ long r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO rlO r31 \ 

long no, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO rll r31 \ 

long rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

ttdefine AUTO rl2 r31 \ 

long rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO rl3 r31 \ 

long rl3, rl4, rl5, rl6, rl7, rl8, rl9, r20, r21, r22, r23, r24, r25, \ 
r26, r27, r28, r29, r30, r31; 
#define AUTO rl4 r31 \ 

long rl4, rl5, rl6, rl7, rl8, rl9, r20, r21, r22, r23, r24, r25, \ 
r26, r27, r28, r29, r30, r31; 
#define AUTO rl5 r31 \ 

long rl5, rl6, rl7, rl8, rl9, r20, r21, r22, r23, r24, r25, \ 
r26, r27, r28, r29, r30, r3,l; 
#define AUTO rl6 r31 \ 
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long 


rie, 


rl7, rl8, 


rl9, 


r20, 


r21. 


r22, 


r23, 


r24. 




t26, 


r27, r28. 


r29. 


r30. 


r31; 








#def ine 


AUTO 


rl7 r31 \ 














long 


rl7, 


rlB, rl9. 


r20. 


r21. 


r22. 


r23. 


r24, 


r25. 




r26. 


r27, r28, 


r29. 


r30. 


r31; 








#define 


AUTO 


ria rSl \ 














long 


rl8. 


rl9, r20. 


r21. 


r22, 


r23. 


r24. 


r25. 


\ 


r26. 


r27, r28, 


r29. 


r30, 


r31; 








#def ine 


AUTO 


rl9 r31 \ 














long 


rl9. 


r20, r21. 


r22, 


r23. 


r24. 


r25. 


\ 




r26. 


r27, r28. 


r29. 


r30. 


r31; 









#define AUTO fO f31 \ 

float fO, fl, f2, f3, f4, f5, f6, f7, f8, f9, flO, fll, fl2, fl3, fl4, \ 
fl5, fl6, fl7, fl8, fl9, f20, f21, f22, f23, f24, f25, f26, f27, \ 
f28, f29, f30, f31; 

#define AUTO dO d31 \ 

double do, dl, d2, d3, d4, d5, d6, d7, dS, d9, dlO, dll, dl2, dl3, dl4, \ 
dl5, dl6, dl7, dl8, dl9, d20, d21, d22, d23, d24, d25, d26, d27, \ 
d28, d29, d30, d31; 

f.&. #if defined ( BUILD MAX ) 

#define AUTO v0_v31 \ 

VMX reg vO , vl, v2, v3, v4 , v5, v6, v7, v8, v9, vlO, vll, vl2, vl3, vl4, 
\ " 

vl5, vl6, vl7, vl8, vl9, v20, v21, v22, v23, v24, v25, v26, v27, \ 
v28, v29, v30, v31; 

#endif 



ij^ * For G implementation, create a dummy stack on function entry of size 

'^'^ 4096. 
5 */ 

#define STACK SIZE 4096 

hi /* 

* macros for C and Fortran callable entry points 
*/ 

#define ENTRY 0( func name ) \ 
C PROTOTYPE 0( func name ) \ 
fU void func name ( void ) \ 

{ \ 

long CR[8]; ulong CTR? ulong VSCR; long rO; \ 
AUTO r3 r31 \ 

AUTO fO f31 \ 
AUTO do d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area E 2*18 + 4 ] ; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + A], sp; 

#define ENTRY 1( func name, argO ) \ 
C PROTOTYPE 1{ func name ) \ 
void func name ( long argO ) \ 
{ \ 

long CR[8] ; ulong CTR; ulong VSCR; long rO; \ 
AUTO r4 r31 \ 
AUTO fO f31 \ 
AUTO do d31 \ 
AUTO_vO v31 \ 

long gpr save area [ 19+4 ]; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE +4], sp; 
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03 

s 



#define ENTRY 2( func name, argO, argl ) 
C PROTOTYPE 2 ( func name ) \ 

{ long argO, long argl 



void func name 
{ \ 

long CR[8] ; 
AUTO r5 r31 
AUTO fO f31 
AUTO do d31 
AUTO vO v31 



\ 

) \ 

long rO; \ 



ulong CTR; ulong VSCR; 

\ 
\ 
\ 
\ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ] ; \ 
long vr save area [ 4*12 + 4 J; \ 
long stack [STACK_SIZE + 4], sp? 

#define ENTRY 3( func name, argO, argl, arg2 ) \ 
C PROTOTYPE 3( func name ) \ 

void func_name { long argO, long argl, long arg2 ) \ 



{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO r6 r31 \ 
AUTO fO f31 \ 
AUTO do d31 \ 
AUTO_vO v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4}, sp; 

#define ENTRY 4{ func name, argO, argl, arg2, arg3 ) \ 
C PROTOTYPE 4 ( func name ) \ 

void func name ( long argO, long argl, long arg2, long arg3 ) \ 

{ \ 

long CR[8] ; ulong CTR; ulong VSCR; long rO; \ 
AUTO r7 r31 \ 
AUTO fO f31 \ 
AUTO do d31 \ 
AUTO_vO V31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY 5( func name, argO, argl, arg2, arg3, arg4 ) \ 
C PROTOTYPE 5( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4 ) \ 



{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO r8 r31 \ 
AUTO fO f31 \ 
AUTO do d31 \ 
AUTO_vO V31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY 6( func name, argO, argl, arg2, arg3, arg4, argS ) \ 
C PROTOTYPE 6 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 

long arg5 ) \ 



{ \ 



long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO r9 r31 \ 
AUTO fO f31 \ 
AUTO do d31 \ 
AUTO_vO V31 \ 

long gpr_save_area [ 19 + 4 ] ; \ 
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long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_7 ( func_name, argO, argl, arg2, arg3, arg4, argS, \ 

arg6 ) \ 
C PROTOTYPE 7 ( f unc name ) \ 

void func__name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6 ) \ 

long CR[8] ; ulong CTR; ulong VSCR; long rO; \ 
AUTO no r31 \ 
AUTO fO f31 \ 
AUTO do d31 \ 
AUTO__vO v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area[ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_8 ( func__name, argO, argl, arg2, arg3, arg4, argS, \ 

arg6, arg? ) \ 
C PROTOTYPE 8 ( f unc name ) \ 
I*©-- void func_name { long argO, long argl, long arg2, long arg3, long arg4, \ 

long argS, long arg6, long arg7 ) \ 

;s { \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
^ AUTO rll r31 \ 

AUTO fO f31 \ 

AUTO do d31 \ 
W AUTO_vO v31 \ 

U§ long gpr save area [ 19 + 4 ] ; \ 

U long fpr save area [ 2*18 + 4 ] ; \ 

long vr save area [ 4*12 + 4 ] ; \ 
- long stack [STACK_SIZE + 4], sp; 

I** 

i^^ #define ENTRY_9 ( func_nanie, argO, argl, arg2, arg3, arg4, arg5, \ 

Y'f arg6, arg7, argS ) \ 

1^ C PROTOTYPE 9 ( f unc name ) \ 

^g; void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 

|k long arg5, long arg6, long arg7, long argS ) \ 

{ \ 

fij long CR[8]; ulong CTR; ulong VSCR; long rO; \ 

AUTO rl2 r31 \ 
AUTO fO f31 \ 
AUTO do d31 \ 
AUTO_vO v31 \ 

long gpr save area [19+4 3 ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 +4 ] ; \ 
long stack ESTACK_SIZE + 4], sp; 

#define ENTRY_10{ func_name, argO, argl, arg2, arg3, arg4, argS, \ 

arg6, arg7, arg8, arg9 ) \ 
C PROTOTYPE 10 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long argS, long arg9 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 
AUTO rl3 r31 \ 
AUTO fO f31 \ 
AUTO do d31 \ 
AUTO_vO v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ] ; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK__SIZE + 4], sp; 
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#define ENTRy__ll ( func_name, argO, argl, arg2, arg3, arg4, argS, \ 

arg6, arg7, arg8, arg9, arglO ) \ 
C PROTOTYPE 11 { func name ) \ 

void func_name { long argO, long argl, long arg2, long arg3, long arg4, \ 
long argS/ long arg6, long arg7, long argS, long arg9, \ 
long arglO ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO rl4 r31 \ 
AUTO fO f31 \ 
AUTO do d31 \ 
AUTO_vO v31 \ 

long gpr save area[ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area[ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_12 ( func__name, argO, argl, arg2, arg3, arg4, argS, \ 

arg6, arg7, argS, arg9, arglO, argil ) \ 
C PROTOTYPE 12 ( func name ) \ 

void func_naTne { long argO, long argl, long arg2, long arg3, long arg4, \ 
long argS, long arg6, long arg7, long argS, long arg9, \ 
17.:. long arglO, long argil ) \ 

P { \ 

0. long CR[8]; ulong CTR; ulong VSCR; long rO; \ 

if!: AUTO rl5 r31 \ 

AUTO fO f31 \ 

AUTO do d31 \ 
£1:1 AUTO_yO v31 \ 

long gpr save area [ 19 + 4 ] ; \ 

long fpr save area [ 2*18 + 4 ] ; \ 

long vr save area [ 4*12 + 4 ]; \ 
s long stack [STACK__SIZE + 4], sp; 



#define ENTRY_13 ( func_name, argO, argl, arg2, argS , arg4, argS, \ 
W arg6, arg7, argS, arg9, arglO, argil, \ 

I*?: argl 2 ) \ 

C PROTOTYPE 13 { func name ) \ 

void func__name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
:J long argS, long arg6, long arg7, long argS, long arg9, \ 

II long arglO, long argil, long argl2 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ? \ 
AUTO rl6 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_vO v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ] ; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_14 ( func__name, argO, argl, arg2, arg3, arg4, argS, \ 

arg6, arg7, argS, arg9, arglO, argil, \ 
argl2, argl3 ) \ 
C PROTOTYPE 14 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long'argS, long arg4, \ 

long argS, long arg6, long argV , long arg8, long arg9, \ 
long arglO, long argil, long argl2, long argl3 ) \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 
AUTO rl7 r31 \ 
AUTO fO f31 \ 
AUTO do d31 \ 
AUTO_vO v31 \ 

long gpr_save_area [ 19+ 4 ]; \ 
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long fpr save area[ 2*18 + 4 ] ; \ 
long vr save area[ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY^IS ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

argS, arg7, argS, arg9, arglO, argil, \ 
argl2, argl3, argl4 ) \ 
C PROTOTYPE 15 { func name ) \ 

void func__name { long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long argS, long arg9, \ 
long arglO, long argil, long argl2, long argl3, \ 
long argl4 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 

AUTO rl8 r31 \ 
AUTO fO £31 \ 
AUTO do d31 \ 
AUTO_vO v31 \ 

long gpr save area [ 19+4 I; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area[ 4*12 + 4 ]; \ 
1 ong s tack [ STACK_S I ZE + 4 ] , sp ; 

#define ENTRY_16 ( func__name, argO, argl, arg2, arg3, arg4, argS, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
O argl2, argl3, argl4, arglS ) \ 

^fl C PROTOTYPE 16 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
W long argS, long arg6, long arg7, long arg8, long arg9, \ 

Q long arglO, long argil, long argl2, long argl3, \ 

m long argl4, long argl5 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
s AUTO rl9 r31 \ 

t^^: AUTO fO f31 \ 

AUTO do d31 \ 
AUTO_vO v31 \ 

long gpr save area I 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save areaE 4*12 + 4 ]; \ 
long stack [STACK_SIZE +4], sp; 



o 
m 



/* 

* macros to get GPR arguments beyond 8 
*/ 

#define GET ARG8 ( rD ) 
#define GET ARG9 ( rD ) 
#define GET ARGIO { rD ) 
#define GET ARGll { rD ) 
#def ine GET ARG12 ( rD ) 
#def ine GET ARG13 ( rD ) 
#def ine GET ARG14 ( rD ) 
#define GET ARG15 ( rD ) 
#define GET ARG16 ( rD ) 
#define GET_ARG17( rD ) 

/* 

* macros to set GPR arguments heyond 8 
*/ 

ttdefine SET ARG8 ( rD ) 
#define SET ARG9 ( rD ) 
#def ine SET ARGIO ( rD ) 
#def ine SET ARGll ( rD ) 
#define SET ARG12 ( rD ) 
#def ine SET ARG13 ( rD ) 
#def ine SET ARG14 ( rD ) 
#define SET_ARG15 { rD ) 
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#def ine SET ARG16 ( rD ) 
#define SET_ARG17( rD ) 

/* 

* macro to branch from one entry point to another 
*/ 

#define BR FUNC( func_name ) \ 
f unc_name { ) ; \ 

/* 

* macros to call functions 
*/ 

#define CALL__FUNC( func_name ) \ 
f unc_name { ) ; 

/* 

* macros to call functions 
*/ 

#def ine CALL_0 ( f unc_name ) \ 
f unc_name ( ) ; 

#define CALL_1 { func name, argO ) \ 
func name ( argO ) ; 

Q #define CALL_2 ( func_name, argO, argl ) \ 
fH func name ( argO, argl ) ; 

^0 #define CALL_3 ( func_name, argO, argl, arg2 ) \ 

'%Q func_name { argO, argl, arg2 ) ; 

^5 #define CALL__4 ( func_name, argO, argl, arg2, arg3 ) \ 

func name ( argO, argl, arg2, arg3 ); 

W 



5 tH 



#define CALL_5 ( func_name, argO, argl, arg2, arg3, arg4 ) \ 
func_name ( argO, argl, arg2, arg3, arg4 ); 

#define CALL 6( func_name, argO, argl, arg2, arg3, arg4, argS ) \ 



f'T func name { argO, argl, arg2, arg3, arg4, argS ); 

#define CALL_7 ( func_name, argO, argl, arg2, arg3, arg4, argS, argS ) \ 
func_name { argO, argl, arg2, arg3, arg4, arg5, arg6 ); 

#define CALL_8 ( func_name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7 ) \ 
func_name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg? ); 

#define CALL_9 ( func_name, argO, argl, arg2, arg3, arg4, argS, arg6, arg7, \ 
argS } \ 

func_name { argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8 ) ; 

#define CALL_10 ( func name, argO, argl, arg2, arg3, arg4, argS, arg6, arg7, \ 
argS, arg9 ) \ 

func_name { argO, argl, arg2, arg3, arg4, argS, arg6, arg7, \ 
argS, arg9 ) ; 

#define CALL_11( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
argS, arg9, arglO ) \ 
func_name ( argO, argl, arg2, arg3, arg4, argS, arg6, arg7, \ 
argS, arg9, arglO ) ; 

#define CALL_12 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
argS, arg9, arglO, argil ) \ 
func_name ( argO, argl, arg2, arg3 , arg4, arg5, arg6, arg7, \ 
argS, arg9, arglO, argil ) ; 

#define CALL__13 ( func name, argO, argl, arg2, arg3, arg4, argS, arg6, arg7 , \ 
argS, arg9, arglO, argil, argl2 ) \ 
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func name ( argO, argl, arg2, arg3 arg4 arg5 , arg6, arg7, \ 
" argS, arg9, arglO, argil, argl2 ), 

ttdefine CALL 14 { func name, argO, argl, arg2, arg3, arg4 argS, arg6, arg7, \ 
#aetine ^ _ ^^^^^ ^^^^^ _ s.rgl2 argl3 ) \ 

' till: IS: lS6,"Sii?'&xr«g?!^ 

#if defined ( BUILD_MAX ) 
/* 

* G4 macros to create a dummy jump table. 
|,^. * (not supported in C) 

l3 #define DECLARE VMX VI ( root name ) 

#define DECLARE VMX V2( root name ) 
in #def ine DECLARE VMX V3 ( root name ) 

% #def ine DECLARE VMX V4 ( root name ) 

#define DECLARE_VMX__V5 ( root_name ) 

#define DECLARE VMX Zl ( root name ) 
#def ine DECLARE VMX Z2 ( root name ) 
%M #define DECLARE VMX Z3 { root name ) 

s #def ine DECLARE VMX Z4 ( root name ) 

f*^: #define DECLARE__VMX_Z5 ( root_name ) 



hi 



5 5| 

i%. * G4 macros to decide whether to enter a VMX loop 

V= * VMX loop is entered if at least minimum count, 

# * all vectors have the same relative alignment 

O * (i.e., same lower 4 bits) and all strides are unit^ 

m * Note /a unit s imm argument is provided because some 

* packed interleaved complex functions (stride 2) such 

* as cvaddxO can be implemented with a VMX loop. 

* Only one macro should be invoked per source file. 

* (not supported in C) 

#define BR IF VMX Vl( root name, min n imm unit s P^', ^^^^^ ^ 

Sdefine BR IF VMX VI ALIGNED ( root name, min n imm, unit_s_imm, \ 

^ - - " pi, si, n, eflag ) 

#define BR IF VMX V2 ( root name, min n imm, unit s_imm, \ 

~ ~ ~ pi, si, p2, s2, n, eflag ) ^ 
itdefine BR IF VMX V2 LS( root name, min n imm, unit s_imm, \ 

- - " ~ pi, si, ps, s2, n, eflag ) ^ 
#define BR IF VMX V2 LC ( root name, min__n imm, unit_s_imm, \ 

~ ~ pi, si, pc, n, eflag ) ... 

#define BR_IF_VMX__V2_ALIGNED ( ^^^^^^^^J^ "^^^ \^™f lag'')'"''™' 
#define BR_IF^VMX_V3 ( -ot^namg n^i^^^ 

#define BR^IF_VMX_V3_ALIGNED ( --^^--^^ ^^^^^ ^^^^"^3^^^^^ 
ttdefine BR_IF_VMX_V4 ( -ot^name/^min' n^^^^^ ^ 
idefine BR_IF„VMX_V4_ALIGNED ( ^l^^^^^^^l^ ^^^^^ ) 
#define BR_IF_VMX_V5 ( root_name, ' min__n_imm, unit_s__imm, \ 
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pl, Si, p2, s2, p3, S3, p4, s4, p5, s5, n, eflag ) 
#define BR_IF_VMX_V5_ALIGNED ( root name, min n iram, unit s imm, \ 

pl, si, p2, s2, p3, S3, p4, S4, p5 , s5, n, 

eflag ) 

#define BR_IF_VMX_Z1 { root_name, min n_imm, unit_s_imm, \ 

prl, pil, si, n, eflag ) 
#define BR_IF_VMX_Z2 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, n, eflag ) 
#define BR_IF_V]yiX_Z3 { root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, n, eflag ) 
#def ine BR_IF__VMX_Z4 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, \ 

pr4, pi4, s4, n, eflag ) 
#define BR_IF_VMX_Z5 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, \ 

pr4, pi4, s4, pr5, pi5, s5, n, eflag ) 
#define BR_IF_VMX_CONV ( root name, min n imm, \ 

pl, si, s2, p3, s3, n, eflag ) 
#define BR_IF_VMX_ZCONV ( root_name, min n imm, \ 

prl, pil, si, s2, pr3, pi3, s3, n, eflag ) 

/* 

, ^ * G4 macro to get VMX unaligned (FP) count 

1"^ * assumes all vectors have the same relative alignment 

f3 * and that the last 2 bits of ptr are 0 

* sets condition code CRO 
H */ 

^-C ttdefine GET VMX UNALIGNED COUNT ( count, ptr ) \ 

^ { \ " " 

m (count) = - (ptr) ; \ 

(count) = ( (count) » 2) & 3? \ 

CR[0] = (long) (count) ; \ 

W } 

tj * G4 macro to get VMX unaligned short count 

|J * assumes that the last bit of ptr is 0 

* sets condition code CRO 
*/ 

#define GET_VMX_UNALIGNED_COUNT_S ( count, ptr ) \ 

C3 { \ 

(count) = - (ptr) ; \ 
(count) = ( (count) >> 1) & 7; \ 
CR[0] = (long) (count) ; \ 

} 

/* 

* G4 macro to get VMX iinaligned char count 

* sets condition code CRO 
*/ 

#define GET VMX UNALIGNED_COUNT_C { count, ptr ) \ 

{ \ " " 

(count) = - (ptr) ; \ 
(coiint) = (count) & 15; \ 
CR[0] = (long) (count) ; \ 

} 

/* 

* G4 macro to load and splat an FP scalar independent of alignment 
*/ 

ttdefine SCALAR_SPLAT ( vt, vtmp, scalarp ) \ 

(vt).f[0] = (vt).f[l] = (vt).ft2] = (vt).f[3] = *scalarp; 

#endif /* end BUILD_MAX */ 

/* 

* cache (DCBT and DCBZ) macros. 
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#define DCBT TRUE ( cond_bit, scratch ) \ 

CR[(cond_bit)] = -1? /* true (<= 0) */ 

#define DCBZ TRUE { cond_bit, scratch ) \ 
DCBT_TRUE{ Gond_bit, scratch ) 



#define DCBT FALSE ( cond_bit, scratch ) \ 

CR[{cond_bit)] =1; /* false {> 0) */ 

#define DCBZ FALSE { cond_bit, scratch ) \ 
DCBT_FALSE( cond_bit, scratch ) 

#define SET DCBT COND( cond bit, cache bit, eflag, scratchl ) \ 
CR[ {cond__bit) ] = (eflag & (cache_bit) ) ; 

#define SET_DCBZ_COND ( cond bit, cache bit, eflag, buffer, stride, 

unit stride, count, tmpl, tmp2, tmp3) \ 
CR[(condj3it)] = (eflag & (cache_bit) ) ; 

#define DCBT IP( cond bit, rA, rB ) \ 
if ( CR[(cond bit)] <= 0 ) \ 
1^ { DCBT( rA, rB ) } 

^te" #define DCBZ IF( cond bit, rA, rB ) \ 

f3 if ( CR[(cond bit)] <= 0 ) \ 

{ DCBZ( rA, rB ) } 

^0 #define DCBT IF CACHABLE( condjDit, rA, rB ) \ 

DCBT_IF( condjDit, rA, rB ) 

tn 

#define DCBZ IF CACHABLE( cond_bit, rA, rB ) \ 

*=AJ= DCBZ_IF( cond_bit, rA, rB ) 

l^-, #define BR IF CACHABLE ( cond bit, label ) \ 

if { CR[(cond bit)] <= 0 ) \ 
lAI goto label; 

#define BR IF NOT CACHABLE { cond__bit, label ) \ 
if ( CR[(cond bit)] > 0 ) \ 
13 goto label; 

* ASIC macros 
*/ 

#if defined ( COMPILE_PREFETCH ) 

#define LOAD PREFETCH CONTROL ( mode, scratchl, scratch2 ) \ 
♦(volatile long * ) PREFETCH_CONTROL = (mode); 

#define LOAD MISCON B( mode, scratchl, scratch2 ) \ 
* (volatile long *)MISCON_B = (mode); 

#define RESET PREFETCH CONTROL ( scratchl, scratch2 ) \ 
{ \ 

volatile long i; \ 

i = * (volatile long *) MISCON_B; \ 

i Sc= PREFETCH MASK; \ 

i 1= USE PREFETCH CONTROL; \ 

♦(volatile long * ) PREFETCH_CONTROL = i; \ 



#else 



#define LOAD PREFETCH CONTROL ( mode, scratchl, scratch2 ) 
#define LOAD MISCON B( mode, scratchl, scratch2 ) 
#define RESET_PREFETCH_CONTROL ( scratchl, scratch2 ) 
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fli 



/* 

* 

#define ADD ( rD, rA, rB ) 
#define ADD_C { rD, rA, rB ) 
(long) (rD) ; 

#define ADDI ( rD, rA, SIMM ) 
#define ADDIC__C ( rD, rA, SIMM ) 
rD) ; 

#define ADDIS ( rD, rA, SIMM ) 
#define AND( rA, rS, rB ) 
#define AND_C{ rA, rS, rB ) 
(long) (rA) ; 

#define ANDC( rA, rS, rB ) 
#define ANDC_C( rA, rS, rB ) 
(long) (rA) ; 

#define ANDI_C( rA, rS, UIMM ) 
rA) ; 

#define ANDIS_C( rA, rS, UIMM ) 

#define BA( addr ) 
#define BCTR 
#define BEQ( label ) 
ttdefine BEQ PLUS ( label ) 
#define BEQ MINUS ( label ) 
#define BEQ CR( bit, label ) 
#define BEQ CR PLUS ( bit, label ) 
#define BEQ CR_MINUS ( bit, label ) 
#define BEQLR 
#define BEQLR PLUS 
#define BEQLR MINUS 
#define BEQLR CR( bit ) 
#define BEQLR CR PLUS ( bit ) 
#define BEQLR CR MINUS ( bit ) 
#define BGE ( label ) 
#define BGE PLUS ( label ) 
#def ine BGE MINUS ( label ) 
#define BGE CR ( bit, label ) 
#define BGE CR PLUS ( bit, label ) 
#define BGE CR_MINUS ( bit, label ) 
#define BGELR 
#define BGELR PLUS 
#define BGELR MINUS 
#define BGELR CR { bit ) 
#define BGELR CR PLUS ( bit ) 
#define BGELR CR MINUS ( bit ) 
#define BGT( label ) 
#define BGT PLUS ( label ) 
#define BGT MINUS ( label ) 
#define BGT CR( bit, label ) 
#define BGT CR PLUS ( bit, label ) 
#define BGT CR_MINUS ( bit, label ) 
#define BGTLR 
#define BGTLR PLUS 
#def ine BGTLR MINUS 
#define BGTLR CR( bit ) 
#define BGTLR CR PLUS ( bit ) 
#define BGTLR CR MINUS ( bit ) 
#define BL ( func name ) 
#define BLE ( label ) 
#define BLE PLUS ( label ) 
^define BLE MINUS ( label ) 
^idefine BLE CR ( bit, label ) 
4idefine BLE_CR_PLUS ( bit, label ) 



(rD) 




(rA) 


+ 


(rB) ; 


(rD) 




(rA) 


+ 


(rB) ; CR[0] = 


(rD) 




(rA) 


+ 


(SIMM) ; 


(rD) 




(rA) 


+ 


(SIMM) ; CR[0] = 


(rD) 




(rA) 


+ 


( (SIMM) << 16) ; 


(rA) 




(rS) 


Sc 


(rB) ; 


(rA) 




(rS) 


Sc 


(rB) ; CR[0] = 


(rA) 




(rS) 


Sc 


~(rB) ; 


(rA) 




(rS) 


& 


~(rB) ; CR[0] = 


(rA) 




(rS) 


& 


(UIMM) ; CR[0] = 


(rA) 




(rS) 


Sc 


( (UIMM) « 16) ; 



(long) ( 



goto (addr) ; 

(*(void (*) (void) )CTR) () ; 
if ( CR[0} == 0 ) goto label; 
BEQ( label ) 
BEQ( label ) 

if ( CR[(bit)} == 0 ) goto label; 

BEQ CR( bit, label ) 

BEQ CR( bit, label ) 

if ( CR[0] == 0 ) return; 

BEQLR 

BEQLR 

if ( CR[(bit)] == 0 ) return; 

BEQLR CR{ bit ) 

BEQLR CR( bit ) 

if ( CR[0] >= 0 ) goto label; 

BGE( label ) 

BGE( label ) 

if ( CR[(bit)3 >= 0 ) goto label; 

BGE CR( bit, label ) 

BGE CR( bit, label ) 

if ( CR[0] >= 0 ) return; 

BGELR 

BGELR 

if ( CRE(bit)] >= 0 ) return; 
BGELR CR{ bit ) 
BGELR CR( bit ) 
if ( CR[0] > 0 ) goto label; 
BGT( label ) 
BGT( label ) 
if ( CR[(bit)] 
BGT CR( bit, label ) 
BGT CR( bit, label ) 
if ( CR[0] > 0 ) return; 
BGTLR 
BGTLR 

if ( CR[(bit)] > 0 ) return; 
BGTLR CR( bit ) 
BGTLR CR( bit ) 
f unc_name ( ) ; 

if ( CR[0] <= 0 ) goto label; 
BLE( label ) 
BLE( label ) 

if ( CR[(bit)] <= 0 ) goto label; 
BLE_CR( bit, label ) 



0 ) goto label; 
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III 



#define BLE CR_MINUS ( bit, label ) 

#define BLELR 

#define BLELR PLUS 

#def ine BLELR MINUS 

#define BLELR CR( bit ) 

#define BLELR CR PLUS ( bit ) 

#define BLELR_CRjyiINUS ( bit ) 

#define BLR 

#define BLT ( label ) 

#define BLT PLUS ( label ) 

#define BLT MINUS ( label ) 

#define BLT CR( bit, label ) 

ttdefine BLT CR PLUS ( bit, label ) 

#define BLT CR_MINUS( bit, label ) 

#define BLTLR 

#define BLTLR PLUS 

#def ine BLTLR MINUS 

#define BLTLR CR( bit ) 

#define BLTLR CR PLUS( bit ) 

#define BLTLR CR MINUS ( bit ) 

#define BNE( label ) 

#define BNE PLUS ( label ) 

#define BNE MINUS ( label ) 

#define BNE CR{ bit, label ) 

ttdefine BNE CR PLUS { bit, label ) 

#define BNE CR_MINUS ( bit, label ) 

#define BNELR 

#define BNELR PLUS 

#define BNELR MINUS 

#define BNELR CR ( bit ) 

#define BNELR CR PLUS ( bit ) 

#define BNELR CR MINUS ( bit ) 

#define BR( label ) 

#define CLRLWI ( rA, rS, nbits ) 

#define CLRLWI_C{ rA, rS, nbits ) 

\ 

#define CLRRWI ( rA, rS, nbits ) 
#define CLRRWI _C ( rA, rS, nbits ) 

#define CMPLW{ rA, rB ) 

#define CMPLW_CR( bit, rA, rB ) 
? \ 

#def ine CMPLWI ( rA, UIMM ) 
\ 



^define CMPLWI_CR( bit, rA, UIMM ) 
31)) ? \ 



^^idefine CMPW{ rA, rB ) 

#define CMPW CR( bit, rA, rB ) 

#define CMPWI ( rA, SIMM ) 

#define CMPWI_CR( bit, rA, SIMM ) 

ttdefine DCBF( rA, rB ) 

#define DCBI ( rA, rB ) 

#define DCBST( rA, rB ) 

#define DCBT{ rA, rB ) 

#define DCBTST( rA, rB ) 

#define DCBZ( rA, rB ) *(long *) 

* (long *) 
*(long *) 
* (long *) 
\ 



BLE CR( bit, label ) 



0 ) return ; 



if ( CR[0] 
BLELR 
BLELR 

if { CR[(bit)] <= 0 ) return; 
BLELR CR( bit ) 
BLELR CR( bit ) 
return ; 

if ( CR[0] < 0 ) goto label; 
BLT( label ) 
BLT( label ) 

if ( CR[(bit)] < 0 ) goto label; 

BLT CR( bit, label ) 

BLT CR( bit, label ) 

if { CR[0] < 0 ) return; 

BLTLR 

BLTLR 

if ( CR[(bit)] < 0 ) return; 

BLTLR CR{ bit ) 

BLTLR CR( bit ) 

if ( CR[0] 1= 0 ) goto label; 

BNE( label ) 

BNE{ label ) 

if ( CR[(bit)] != 0 ) goto label; 
BNE CR( bit, label ) 
BNE CR( bit, label ) 
if { CR[0] 1= 0 ) return; 
BNELR 
BNELR 

if ( CR[(bit)] 1= 
BNELR CR( bit ) 
BNELR CR{ bit ) 
goto label; 
(rA) = (rS) & ((1 



0 ) return ; 



(rA) = (rS) & ( (1 << 



(32-nbits) ) - 1) ; 
{32-nbits) ) - 1) ; 



CR[0] = (long) (rA) ; 

(rA) = (rS) Sc --({1 « nbits) - 1); 

(rA) = (rS) & ~({1 « nbits) - 1) ; \ 

CR[03 = (long) (rA) ; 

CR[0] = (((rA)"(rB)) & (1 « 31)) ? \ 

((rB) - (rA)) : ( (rA) - (rB) ) ; 
CR[(bit)] = {((rA)^(rB)) & (1 « 31)) 

( (rB) - (rA) ) : ( (rA) - (rB) ) ; 
CR[0] = ( ( (rA) ^ (UIMM) ) & (1 « 31)) ? 

((UIMM) - (rA)) : ((rA) - 
(UIMM) ) ; 

CR[(bit)3 = (((rA)"{UIMM)) & (1 « 

((UIMM) - (rA)) : ( (rA) - 

(UIMM) ) ; 
CR[0] = (rA) - (rB) ; 
CR[(bit)] = (rA) - (rB); 
CR[0] = (rA) - (SIMM); 
CR[(bit)3 = (rA) - (SIMM); 



(((rA) + (rB)) & --CACHE LINE MASK) = 0; \ 
( ( ( (rA) + (rB) ) & --CACHE LINE MASK) +4) = 0; \ 
( ( ( (rA) + (rB) ) & -CACHE LINE MASK) +8) = 0; \ 
( ( ( (rA) + (rB) ) & ~'CACHE_LINE_MASK) +12) - 0; 
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DECR( rD ) 
DECR C( rD ) 
DIVW{ rD, rA, rB ) 
DIVW C( rD, rA, rB 



* ( long 
\ 

* (long 
\ 

* (long 
\ 

* (long 



#def ine 
# define 
#define 

#define DIVW__C( rD, rA, rB ) 
(long) (rD) ; 

#define DIVWU( rD, rA, rB ) 
ttdefine DIVWU C( rD, rA, rB 



*) {(((rA)+(rB)) 

*) ({((rA)+{rB)) 

*) ((((rA)+(rB)) 

*) ((((rA)+(rB)) 
--(rD) ; 
--(rD) 
(rD) = 
(rD) = 



2/23/2001 

«'CACHE_LINE_MASK)+16) = 0 

'«CACHE_LINE_MASK)+20) = 0 

~CACHE_LINE_MASK) +24) = 0 

~CACHE_LINE_MASK)+28) = 0 

CR[0] = (long) (rD) ,- 
(rA) / (rB) ; 
(rA) / (rB) ; CR[03 = 



#define EQV( rA, rS, rB ) 
#define EQV_C( rA, rS, rB ) 

#define FABS( frD, frB ) 
-(frB) ; 

#define FADD( frD, frA, frB ) 
#define FADDS( frD, frA, frB ) 
#define FCMPO( bit, frA, frB ) \ 
{ \ 

if ( (frA) < (frB) 
else if ( (frA) > 
else CR[(bit)] = 0; \ 



(rD) = 
(rD) = 

(rA) - 
(rA) = 
CR[0] 
(frD) 

(frD) 
(frD) 



) CR[(bit)] = -1; \ 
(frB) ) CRE(bit)] = li 



(ulong) (rA) 

(ulong) (rA) 

CR[0] 

'^( (rS) 

^((rS) " (rB)); 
= (long) (rA) ; 
= ( (frB) >= 0.0) 



/ (ulong) (rB) 
/ (ulong) (rB) 
(long) (rD) ; 
(rB) ) ; 



\ 



(frB) 



(frA) 
(frA) 



(frB) 
(frB) 



FCMPU( bit, frA, 
FCTIW( frD, frB ) 
FCTIWZ( frD, frB 



) \ 



#define FCMPU( bit, frA, frB ) 
#def ine 
#def ine 
{ \ 

union { \ 

long i [2] ; \ 
double d; \ 
} u; \ 

u.i[0] = (long) (frB) ; 
u.i[l] = 0; \ 
(frD) = u.d; \ 



FCMPO{ bit, frA, frB ) 



} 

#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
(frB) ; 
#define 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
#def ine 



FDIV( frD, frA, frB ) 
FDIVS( frD, frA, frB ) 
FMADD( frD, frA, frC, frB) 
FMADDS ( frD, frA, f rC, frB) 
FMOV( frD, frB ) 
FMR( frD, frB ) 
FMUL( frD, frA, frB ) 
FMULS( frD, frA, frB ) 
FMSUB( frD, frA, frC, frB ) 
FMSUBS( frD, frA, frC, frB 
FNABS( frD, frB ) 



FNEG( frD, frB ) 
FNMADD( frD, frA, frC, frB ) 
FNMADDS( frD, frA, frC, frB ) 
FNMSUB( frD, frA, frC, frB ) 
FNMSUBS( frD, frA, frC, frB ) 
FRES( frD, frB ) 
FRSP( frD, frB ) 
#define FRSQRTE ( frD, frB ) 
#define FSEL( frD, frA, frC, 
(frB) ; 
#def ine 
#def ine 
#def ine 
#def ine 
#define 



frB ) 



FSUB{ frD, frA, frB ) 
FSUBS( frD, frA, frB ) 
GOTO( label ) 
INCR( rD ) 
INCR C{ rD ) 



(frD) 




(frA) / 


(frB) ; 




(frD) 




(frA) / 


(frB) ; 




(frD) 




(frA) * 


(frC) + 


(frB) ; 


(frD) 




(frA) * 


(frC) + 


(frB) ; 


(frD) 




(frB) ; 






(frD) 




(frB) ; 






(frD) 




(frA) * 


(frB) ; 




(frD) 




(frA) * 


(frB) ; 




(frD) 




(frA) * 


(frC) - 


(frB) ; 


(frD) 




(frA) * 


(frC) - 


(frB) ; 


(frD) 




( (frB) 


>= 0.0) 


? -(frB) : 


(frD) 




-(frB) ; 






(frD) 




- ( (frA) 


* (frC) 


+ (frB)); 


(frD) 




-((frA) 


* (frC) 


+ (frB))? 


(frD) 




-((frA) 


* (frC) 


- (frB)); 


(frD) 




-((frA) 


* (frC) 


- (frB) ) ; 


(frD) 




(float) (frB) ; 




(frD) 




( (frA) 


>= 0.0) 


? (frC) : 


(frD) 




(frA) - 


(frB) ; 




(frD) 




(frA) - 


(frB) ; 





BR( label ) 
++ (rD) ; 
++(rD) ; CR[03 



- (long) (rD) 
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#define LA( rD, symbol, SIMM ) 
tdefine LABEL ( label ) 
#define LBZ ( rD, rA, d ) 
#define LBZA( rD, symbol ) 
#define LBZU( rD, rA, d ) 
#define LBZUX ( rD, rA, rB ) 
#define LBZX( rD, rA, rB ) 
#define LFD ( frD, rA, d ) 
#define LFDU( frD, rA, d ) 
#define LFDUX { frD, rA, rB ) 
#define LFDX( frD, rA, rB ) 
#define LFS ( frD, rA, d ) 
#define LFSA( frD, symbol, rT ) 
#define LFSU( frD, rA, d ) 
#define LFSUX( frD, rA, rB ) 
#define LFSX( frD, rA, rB ) 
#define LHA( rD, rA, d ) 
#define LHAA( rD, symbol ) 
#define LHAU{ rD, rA, d ) 
#define LHAXJX{ rD, rA, rB ) 
ttdefine LHAX( rD, rA, rB ) 
ttdefine LHZ ( rD, rA, d ) 
#define LHZA( rD, symbol ) 
#define LHZU( rD, rA, d ) 
ttdefine LHZUX( rD, rA, rB ) 
#define LHZX( rD, rA, rB ) 
#define LI( rD, SIMM ) 
ttdefine LIS ( rD, SIMM ) 
ttdefine LOAD_COUNT( rD ) 
ttdefine LWZ ( rD, rA, d ) 
ttdefine LWZA ( rD, symbol ) 
ttdefine LWZU ( rD, rA, d ) 
ttdefine LWZUX ( rD, rA, rB ) 
ttdefine LWZX( rD, rA, rB ) 
ttdefine MCRF( crfD, crfS ) 
ttdefine MCRFS ( crfD, crfS ) 
ttdefine MFCR ( rD ) 
ttdefine MFCTR( rD ) 
ttdefine MFLR( rD ) 
ttdefine MFSPR( rD, SPR ) 
ttdefine MOV ( rA, rS ) 
ttdefine MOV_C ( rA, rS ) 
#define MR( rA, rS ) 
#define MR C( rA, rS ) 
^define MTCR( rD ) 
II define MTCTR( rD ) 
^define MTFSFI ( crfD, IMM ) 
ttdefine MTLR { rD ) 
ttdefine MTSPR( SPR, rS ) 
ttdefine MULLI { rD, rA, SIMM ) 
ttdefine MULLW( rD, rA, rB ) 
ttdefine MULLW_C( rD, rA, rB ) 
(long) (rD) ; 

ttdefine NAND( rA, rS, rB ) 
ttdefine NAND_C( rA, rS, rB ) 
rA) ; 

ttdefine NEG( rD, rA ) 
ttdefine NEG_C( rD, rA ) 
ttdefine NOP 

ttdefine NOR( rA, rS, rB ) 
ttdefine NOR_C( rA, rS, rB ) 
rA) ; 

ttdefine 0R( rA, rS, rB ) 
ttdefine OR C( rA, rS, rB ) 
(long) (rA) ; 

ttdefine ORC( rA, rS, rB ) 
ttdefine ORC_C( rA, rS, rB ) 



(rD) = 
label : 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(frD) = 
(frD) = 
(frD) = 
(frD) : 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
CTR = 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(rD) = 



(long) & (symbol) 



* (uchar *) 
* (uchar *) 
♦(uchar *) 
* (uchar *) 
* (uchar *) 
= * (double 
= * (double 
= * (double 
= * (double 
= * (float * 
= * (float 
= * (float 
= * (float 
= * (float 

* { short 

* ( short 

* ( short 

* (short 

* (short 

* ( usher t 

* (ushort 

* (ushort 

* (ushort 

* (ushort 

(SIMM) ; 

( (SIMM) « 
(rD) ; 

* ( long 

* (long 

* (long 

* (long 

* ( long 



( (rA) + (d) ) ; 
&( symbol) ; 
((rA) += (d)) 
((rA) += (rB)) 



((rA) + 
*) ( (rA) 
*) ((rA) 
*) ( (rA) 
*) { (rA) 
) ( (rA) 



(rB) ) 
+ (d) ) ; 
+= (d)); 
+= (rB)) 
+ (rB)); 
(d)) ; 



) & (symbol) 
) ( (rA) += (d) ) ; 
) ( (rA) += (rB) ) ; 
) ( (rA) + (rB) ) ; 
((rA) + (d)); 
& ( symbol ) ; 
{(rA) += (d)); 
( (rA) (rB) ) ; 

((rA) + (rB)); 
) ( (rA) + (d) ) ; 
) & (symbol) ; 
) ( (rA) += (d) ) ; 
) ( (rA) += (rB) ) i 
) ( (rA) + (rB) ) ; 



16) ; 

(rA) + (d)); 
(symbol) ; 
(rA) += (d) ) ; 
(rA) += (rB)) 
(rA) + (rB) ) ; 



(rA) 
(rA) 



(rS) ; 

(rS) ; CR[0] 



(long) (rA) ; 



(rA) 


= (rS) ; 


(rA) 


= (rS); C 


(rD) 


= (rA) * 


(rD) 


:= (rA) * 


(rD) 


= (rA) * 


(rA) 


= ^((rS) 


(rA) 


= -((rS) 


(rD) 


= - (rA) ; 


(rD) 


= -(rA) ; 


(rA) 


= -^((rS) 


(rA) 


- ~((rS) 


(rA) 


= (rS) 1 


(rA) 


= (rS) 1 


(rA) 


= (rS) i 


(rA) 


= (rS) 1 



(rB) 
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m 
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m 



(UIMM) ; 

( (UIMM) « 16) ; 



(ME) ) ; \ 

- (SH) ) ) ) & mask) ; \ 



(long) (rA) ; , . , 

#define ORI ( rA, rS, UIMM ) (rA) = (rS) 

tdefine ORIS( rA, rS, UIMM ) (rA) = (rS) 

#define RETURN BLR 
#define RLWIMI ( rA, rS, SH, MB, ME ) \ 
{ \ 

ulong mask; \ 

mask = ((1 << ((ME) - (MB) +1)) - 1) « (31 - 
(rA) &= -mask; \ 

(rA) 1= ( ( ( (rS) << (SH)) I ( (ulong) (rS) >> (32 

idefine RLWIMI_C ( rA, rS, SH, MB, ME ) \ 
{ \ 

ulong mask; \ 

mask = ((1 « ((ME) - (MB) + 1) ) - 1) « (31 - (ME)); \ 
(rA) Sc= --mask; \ . ^ . ^ , x \ 

(rA) 1= ((((rS) << (SH)) | ( (ulong) (rS) » (32 - (SH) ) ) ) & mask) ; \ 
CR[0] = (long) (rA) ; \ 

#define RLWINM( rA, rS, SH, MB, ME ) \ 
{ \ 

ulong mask; \ 

mask = ((1 « ((ME) - (MB) +1)) - 1) « (31 - (ME)); \ 
(rA) = (((rS) « (SH)) | ( (ulong) (rS) » (32 - (SH) ) ) ) & mask; \ 

#define RLWINM_C( rA, rS, SH, MB, ME ) \ 
{ \ 

ulong mask; \ 

mask = ((1 « ((ME) - (MB) +1)) - 1) « (31 - (ME)); \ 
(rA) = (((rS) « (SH)) | ( (ulong) (rS) » (32 - (SH) ) ) ) & mask; \ 
CR[0] = (long) (rA) ; \ 

#define RLWNM( rA, rS, rB, MB, ME ) 
#define RLWNM_C( rA, rS, rB, MB, ME ) 



RLWINM( rA, rS, (rB) & Oxlf, MB, ME ) 
RLWINM_C( rA, rS, (rB) & Oxlf, MB, ME 



) 

ttdefine EXTLWI ( rA, rS, n, b ) 
#define EXTLWI C( rA, rS, n, b ) 
#define EXTRWK rA, rS, n, b ) 
#define EXTRWI C( rA, rS, n, b ) 
#define INSLWI ( rA, rS, n, b ) 
) 

#define INSLWI_C( rA, rS, n, b ) 

(b) + (n)-l ) 

#define INSRWI ( rA, rS, n, b ) 

+ {n)-l ) 

#define INSRWI_C( rA, rS, n, b ) 
b) + (n) -1 ) 

#define ROTLW( rA, rS, rB ) 
#define ROTLW C( rA, rS, rB ) 
#define ROTLWK rA, rS, n ) 
#define ROTLWI C( rA, rS, n ) 
#define ROTRWK rA, rS, n ) 
#define ROTRWI C( rA, rS, n ) 
#define SLW ( rA, rS, rB ) 
#define SLW_C { rA, rS, rB ) 
(long) (rA) ; 

#define SLWI ( rA, rS, SH ) 
#define SLWI_C ( rA, rS, SH ) 
(long) (rA) ; 

#define SRAW( rA, rS, rB ) 
#define SRAW_C( rA, rS, rB ) 
long) (rA) ; 

#define SRAWI { rA, rS, SH ) 
#define SRAWI_C( rA, rS, SH ) 
long) (rA) ; 

#define SRW( rA, rS, rB ) 
#define SRW_C( rA, rS, rB ) 



RLWINM( rA, rS, (b) , 0, (n) -1 ) 
RLWINM C( rA, rS, (b) , 0, (n) -1 ) 
RLWINM( rA, rS, (b) + (n) , 32- (n) , 31 ) 
RLWINM ( rA, rS, (b)+(n), 32- (n) , 31 ) 
RLWIMI ( rA, rS, 32- (b), (b) , (b)+(n)-l 

RLWIMI_C( rA, rS, 32- (b) , (b) , 

RLWIMI ( rA, rS, 32- ( (b) + (n) ) , (b) , (b) 

RLWIMI_C( rA, rS, 32- ( (b) + (n) ) , (b) , ( 

RLWNM( rA, rS, rB, 0, 31 ) 

RLWNM C( rA, rS, rB, 0, 31 ) 

RLWINM( rA, rS, (n) , 0, 31 ) 
RLWINM C( rA, rS, (n) , 0, 31 ) 

RLWINM( rA, rS, 32- (n) , 0, 31 ) 

RLWINM( rA, rS, 32- (n) , 0, 31 ) 

(rA) = (rS) « (rB) ; 

(rA) = (rS) << (rB) ; CR[0} = 

(rA) - (rS) « (SH) ; 

(rA) = (rS) « (SH) ; CRCO] = 

(rA) = (long) (rS) » (rB) ; 

(rA) = (long) (rS) » (rB) ; CR[03 = ( 

(rA) = (long) (rS) >> (SH) ; 

(rA) = (long) (rS) >> (SH) ; CR[0] = ( 

(rA) = (ulong) (rS) » (rB) ; 

(rA) - (ulong) (rS) » (rB) ; CR[0] = ( 
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long) (rA) ; 

#define SRWI ( rA, rS, SH ) 
#define SRWI_C( rA, rS, SH ) 
long) (rA) ; 

#define STB ( rS, rA, d ) 
#define STBU( rS, rA, d ) 
#define STBUX{ rS, rA, rB ) 
#define STBX{ rS, rA, rB ) 
#define STFD ( frD, rA, d ) 
#define STFDU( frD, rA, d ) 
#define STFDUX ( frD, rA, rB ) 
#define STFDX( frD, rA, rB ) 
#define STFS ( frD, rA, d ) 
#define STFSU( frD, rA, d ) 
#define STFSUX( frD, rA, rB ) 
#define STFSX( frD, rA, rB ) 
ttdefine STH{ rS, rA, d ) 
#define STHU( rS, rA, d ) 
#define STHUX( rS, rA, rB ) 
#define STHX( rS, rA, rB ) 
#define STW( rS, rA, d ) 
#define STWU( rS, rA, d ) 
#define STWUX( rS, rA, rB ) 
ttdefine STWX{ rS, rA, rB ) 
#define SUB( rD, rA, rB ) 
ttdefine SUB_C{ rD, rA, rB ) 
(long) (rD) ; 

ttdefine SUBFIC( rD, rA, SIMM ) 
ttdefine SUBI( rD, rA, SIMM ) 
ttdefine SUBIC_C( rD, rA, SIMM ) 
rD) ; 

ttdefine SUBIS ( rD, rA, SIMM ) 
ttdefine TEST_COUNT( label ) 
ttdefine XOR( rA, rS, rB ) 
ttdefine XOR_C ( rA, rS, rB ) 
(long) (rA) ; 

ttdefine XORI ( rA, rS, UIMM ) 
ttdefine XORIS ( rA, rS, UIMM ) 

ttif defined ( BUILD_MAX ) 



VMX instructions 



ttdefine BR VMX ALL TRUE ( label ) 
ttdefine BR VMX ALL FALSE ( label ) 
ttdefine BR VMX NONE TRUE ( label ) 
ttdefine BR VMX SOME FALSE ( label } 
^define BR__VMX_SOME_TRUE ( label ) 

^define DSS ( STRM ) 

ttdefine DSSALL 

ttdefine DST( rA, rB, STRM ) 

ttdefine DSTT( rA, rB, STRM ) 

ttdefine DSTST( rA, rB, STRM ) 

ttdefine DSTSTT ( rA, rB, STRM ) 

ttif defined ( COMPILE NON_ALIGNED ) 

ttdefine VMX_ADDR_MASK 0 

ttelse 

ttdefine VMX_ADDR_MASK 15 
ttendif 

ttif defined { COMPILE_LVX_CHARS ) 

ttdefine LVX( vT, rA, rB ) \ 
{ \ 



2/23/2001 



(rA) = (ulong) (rS) 
(rA) = (ulong) (rS) 



(SH) ; 

(SH) 7 CR[0] 



= ( 



* (char *) ( (rA) + (d) ) = ( 
*(char *) ( (rA) += (d) ) = 
*(char *) ( (rA) +- (rB) ) = 



*{char *) ((rA) + (rB) ) = 
* (double *) ((rA) + (d) ) 
* (double *)((rA) += (d) ) 

* (double *)((rA) += (rB) ) 

* (double 
* (float 
* (float 



* (float 
* (float 
* (short 
* (short 
* (short 
* (short 
* (long 
* (long 
* (long 
* (long 
(rD) = 
(rD) = 

(rD) = 
(rD) = 
(rD) = 



*) ( (rA) 
*) ((rA) 
*) ((rA) 
*) ( (rA) 
*) ( (rA) 
*) ((rA) 
*) ((rA) 
*) ( (rA) 
*) { (rA) 
*) ( (rA) 
*) ( (rA) 
*) ( (rA) H 
*) ( (rA) 
*) ( (rA) 
*) ( (rA) 
(rA) 



(rB) ) 
(d)) = 
= (d)) = 
= (rB)) 
(rB)) = 
(d)) = 
= (d)) = 
= (rB)) 

(rB)) = 
(d)) = { 
+= (d)) = 
+= (rB)) = 
+ (rB)) = 
(rB) 



rS) ; 
(rS) ; 

(rS) ; 
(rS) ; 

(frD) ; 
= (frD); 

= (frD). 
= (frD); 
(frD) ; 

frD) ; 
= (frD); 

(frD) ; 
(rS) ; 

(rS) ; 
- (rS) ; 

(rS) ; 
rS) ; 
(rS) ; 

(rS) ; 
(rS) ; 



(rA) - (rB) ; CR[0] = 



(SIMM) - (rA) 
(rA) - (SIMM) 
(rA) - (SIMM) 



CR[0] = (long) ( 



(rD) 
if ( 
(rA) 
(rA) 

(rA) 
(rA) 



(rA) - ( (SIMM) « 16) ; 
-CTR ) goto label; 
(rS) " (rB) ; 
(rS) " (rB) ; CR[0] = 



(rS) 
(rS) 



(UIMM) ; 

{ (UIMM) << 



16) 



if ( 
if ( 
if ( 
if ( 
if ( 



CR[6] & 
CR[6] & 
CR[6l & 
! {CRC6] 
i (CR[6] 



0x8 ) goto label; 
0x2 ) goto label; 
0x2 ) goto label; 
& 0x8) ) goto label? 
& 0x2) ) goto label; 
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char *addr; \ 
ulong i; \ 

addr = (char *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 16; i++ ) \ 

(vT).c[C INDEX MUNGE( i )] = addr[i]; \ 

} 

#define LVEBX{ vT, rA, rB ) \ 

{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX_ADDR_MASK; \ 
(vT).c[C INDEX MUNGE( i )] = addr[0]; \ 

} 

#define LVEHX( vT, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *) (((ulong) (rA) + (ulong) (rB) ) & -1); \ 
i = (ulong) addr & VMX_ADDR_MASK; \ 

(vT).c[C INDEX MUNGE( i )] = addr[0]; \ 

(vT).c[C INDEX MUNGE( i + 1 )] = addr EH; \ 

} 

H- #define LVEWX( vT, rA, rB ) \ 

O ^ ^ V, * X 

char *addr; \ 
-H" ulong i; \ 

^ addr = (char *)(( (ulong) (rA) + (ulong) (rB) ) & -3); \ 

i = (ulong) addr & ViyiX_ADDR_MASK; \ 
^% (vT).c[C INDEX iyiUNGE( i )] = addr[0]; \ 

(vT).c[C INDEX MUNGE( i + 1 )] = addr[l]; \ 
CB (vT).c[C INDEX MUNGE( i + 2 )] = addr [2]; \ 

(vT).c[C INDEX MUNGE( i + 3 )] = addr [3]; \ 

I 

#elif defined ( COMPILE LVX SHORTS ) 



#define LVX( vT, rA, rB ) \ 

rr= { \ 

short *addr; \ 
£ ulong i; \ 

A addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & '-VMX_ADDR_MASK) ; \ 

y for ( i = 0; i < 8; i++ ) \ 

fy (vT).s[S INDEX MUNGE( i )] = addr [i] ; \ 

} 

#define LVEBX{ vT, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX_ADDR_MASK; \ 
(vT).c[C INDEX MUNGE( i )] = addrEO]; \ 

} 

#define LVEHX( vT, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~1) ; \ 
i =: ((ulong) addr & VMX ADDR MASK) » 1; \ 
(vT).sES INDEX MUNGE( i )] = addrEO]; \ 

} 

#define LVEWX( vT, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i ; \ 

addr = (short *) ({(ulong) (rA) + (ulong) (rB) ) & ~3) ; \ 
i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
(vT) .sES_INDEX__MUNGE( i )] = addrEO]; \ 
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{vT).s[S INDEX MUNGE( i + 1 )] = addr[l]; \ 

} 

#else 

#define LVX( vT, rA, rB ) \ 
{ \ 

long *addr; \ 
ulong i; \ 

addr = (long *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 4; i++ ) \ 

(vT).lEL INDEX MUNGE( i )] = addr[i]; \ 

} 

#define LVEBX( vT, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *){ (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX_ADDR_MASK; \ 
(vT).cCC INDEX MUNGE( i )] = addr[0]; \ 

} 

#define LVEHX( vT, rA, rB ) \ 
{ \ 

l^^ short *addr; \ 

I'Z ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & -1); \ 
Q i = ((ulong) addr & VMX ADDR MASK) » 1; \ 

?Q (vT) . s [S_INDEX_MUNGE ( i )] = addr[0]; \ 

'%0 #define LVEWX( vT, rA, rB ) \ 

Cfl { \ 

ff;^ long *addr; \ 

ulong i; \ 

\^ addr = (long *)(( (ulong) (rA) + (ulong) (rB) ) & -3); \ 

* i = ((ulong) addr & VMX ADDR MASK) » 2; \ 

(vT) . 1 [L_INDEX_MUNGE ( i )] = addr[0]; \ 



-PS-" 



#endif 

#if defined ( COMPILE_STVX_CHARS ) 



fU #define STVX( vS, rA, rB ) \ 

{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *) {((ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for { i = 0; i < 16; i++ ) \ 

addr[i] = (vS).c[C INDEX MUNGE ( i )]; \ 

} 

#define STVEBX( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *) ((ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addr[0] = (vS).cEC INDEX MUNGE ( i )]; \ 

} 

#define STVEHX( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *) (((ulong) (rA) + (ulong) (rB) ) & -1); \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addr[0] = (vS).c[C INDEX MUNGE ( i )]; \ 
addr[l] = (vS) . c [C_INDEX_MUNGE ( i + 1 )]; \ 



} 
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#define STVEWX( vS, rA, rB ) \ 

char *addr; \ 

addr^=^{char *) ( ( (ulong) (rA) + (ulong) (rB) ) & -3) ; \ 

i = (ulong) addr & VMX ADDR MASK; \ 

addr[0] = {vS).c[C INDEX MUNGE { i )]; \ 

addr[l] = (vS) .c[C INDEX MUNGE ( i + 1 )]; \ 

addr [2] = (vS).c[C INDEX MUNGE ( i + 2 )]; \ 

addr [3] = (vS) . c [C__INDEX_MUNGE { i + 3 )]; \ 

} 

#elif defined ( COMPILE_STVX_SHORTS ) 
#define STVX{ vS, rA, rB ) \ 
short *addr; \ 

addr^=^ (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 8; i++ ) \ , . v 

addr[i] = (vS) .s [S_INDEX_MUNGE ( i )]? \ 

tdefine STVEBX( vS, rA, rB ) \ 

^Z. { \ 

13 char *addr; \ 

addr^=^ (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX ADDR MASK; \ 
\Q addr[0] = (vS) , c [C_INDEX_MUNGE ( i )]; \ 

} 

i'i: #define STVEHX( vS, rA, rB ) \ 

H { \ 

l.y short *addr; \ 

ulong i;\ ,^ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & -1); \ 
i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
addr[0] - (vS) , s [S_INDEX_MUNGE ( i )]; \ 

#define STVEWX ( vS, rA, rB ) \ 
{ \ 

P short *addr; \ 

fU addr^=^ (short *)(( (ulong) (rA) + (ulong) (rB) ) & -3) ; \ 

i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
addrlO] = (vS).s[S INDEX MUNGE ( i )]; \ 
addr [13 = (vS) .s [S_INDEX_MUNGE ( i + 1 )3; \ 

} 

#else 

#define STVX( vS, rA, rB ) \ 
{ \ 

long *addr; \ 

add?^=^(long *){{ (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 4; i++ ) \ , . v 

addrCi] = (vS) . 1 CL_INDEX_MUNGE ( i )); \ 

^Idefine STVEBX( vS, rA, rB ) \ 
char *addr; \ 

ulong i; \ , , ^ / x \ 

addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addrEO] = (vS) .c [C_INDEX_MUNGE ( i )]; \ 

#define STVEHX( vS, rA, rB ) \ 
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short *addr; \ 

addr^=^ (short *) ( ( (ulong) (rA) + (ulong) (rB) ) & -1) ; \ 
i = ( (ulong) addr & VMX ADDR MASK) » 1; \ 
addrEO] = (vS) . s [S_INDEX_MUNGE ( i )]; \ 

#define STVEWX( vS, rA, rB ) \ 
{ \ 

long *addr; \ 

add?^=^(long *)(( (ulong) (rA) + (ulong) (rB)) & --3); \ 
i = ((ulong) addr & VMX ADDR MASK) » 2; \ 
addr[0] = (vS) .1 [L_INDEX__MUNGE ( i )3; \ 

} 

#endif 

#define LVSL_BE( vT, rA, rB ) \ 

j"^=^uiiong) (rA) + (ulong) (rB) ) & VMX_ADDR_MASK; \ 
for ( i = 0; i < 16; i++ ) \ 
U (vT) .uc[i] = j + i; \ 

O #define LVSR_BE( vT, rA, rB ) \ 

%0 ^ \ ■ • \ 

%D fTie^- U (ulong) (rA) + (ulong) (rB) ) & VMX>DDR_MASK) ; \ 

Ifl for ( i = 0; i < 16; i++ ) \ 

fS; (vT) .uc[i] = j + i; \ 

y ^ 

s #if defined ( LITTLE ENDIAN ) , „ r. t, ^ 

Idefine LVSL( vT. rA, rB ) LVSR BE( vT, rA, rB ; 

H mltlll LVSrI VT, rA, rB ) LVSL_BE( vT, rA, rB ); 

ii Sdefine LVSL( vT, rA, rB ) LVSL BE ( vT, rA, rB ); 

V ttdefine LVSR( vT, rA, rB ) LVSR_BE( vT, rA, rB ) ; 

P ttendif 

m ttdefine LVXL( vT, rA, rB ) LVX( vT, rA, rB ) 

#define STVXL( vS, rA, rB ) STVX( vS, rA, rB ) 

#define VADDFP( vT, vA, vB ) \ 
{ \ 

ulong X ; \ 

float a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (vA) .f [i] ; \ 

b = (vB) .f [i] ; \ 

c = a + b; \ 
(vT) .f [i] = c; \ 

#define VADDSBS ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; 1 < 16; i++ ) { \ 

itemp = (long) (vA) .c[i] + (long) (vB) .c[x] ; \ 
if ( itemp < -128 ) (vT).c[i] = -128; \ 
else if ( itemp > 127 ) (vT).c[i] = 127; \ 
else (vT).c[i] = (char) itemp; \ 

#define VADDSHS ( vT, vA, vB ) \ 
{ \ 
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ulong i; \ 
long itemp; \ 

for { i = 0; i < 8; i++ ) { \ . r • \ 

itemp = (long) (vA) .s[i] + (long) (vB) . s [i] ; \ 
if ( itemp < -32768 ) (vT).s[i] = -32768; \ 
else if ( itemp > 32767 ) (vT).s[i] = 32767; \ 
else (vT).s[i] = (short) itemp; \ 

} ^ ^ 

#define VADDSWS ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 4; i++ ) { \ 

itemp = (vA).l[i] + (vB}.l[i]; \ n^ ^ \ 

if ( ( (vA).l[i] > 0) && ( (vB).l[i] > 0) && (itemp < 0) ) \ 

(vT).l[i] = (long) 0x7ffff fff ; \ ^. «x > v 

else if ( ( (vA).l[i] < 0) { (vB).l[i] < 0) && (xtemp > 0) ) \ 

(vT).l[i] = (long) 0x80000000; \ 
else (vT) .1 = itemp [i]; \ 

#define VADDUBM( vT, vA, vB ) \ 

IZ { \ 

t*3 ulong i; \ 

C3 for ( i = 0; i < 16; i++ ) \ 

:% (vT).uc[i3 = (vA).uc[i] + (vB).uc[i]; \ 

. } 

%0 #define VADDUBS( vT, vA, vB ) \ 

m { \ 

ulong 1, Itemp; \ 
W for ( i = 0; i < 16; i++ ) { \ r • -. v 

IJ itemp = (ulong) (vA) .uc[i] + (ulong) (vB) .uc [x] ; \ 

if ( itemp > 255 ) (vT).uc[i] = 255; \ 
else (vT).uc[i] = (uchar) itemp; \ 

^ } \ 

ttdefine VADDXJHM( vT, vA, vB ) \ 
{ \ 

ulong x; \ 

for ( i = 0; i < 8; i++ ) \ ^ • -, x 

IP {vT).us[i] = (vA).us[i] + (vB).us[x]; \ 

#define VADDUHS ( vT, vA, vB ) \ 

ulong 1, Itemp; \ 
for ( i = 0; i < 8; i++ ) { \ 

itemp = (ulong) (vA) .us [i] + (ulong) (vB) .us [x] ; \ 
if ( itemp > 65535 ) (vT).uc[i] = 65535; \ 
else (vT).uc[i] = (ushort) itemp; \ 

, »^ 

#define VADDtJWM( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ . ^ ■ •. v 

(vT).ul[i] = (vA).ul[i] + (vB).ul[x]; \ 

#define VADDUWS ( vT, vA, vB ) \ 

ulong 1, Itemp; \ 
for ( i = 0; i < 4; i++ ) { \ 

itemp = (vA).ul[i] + (vB).ul[i]; \ ^^^.^^^^r \ 

if ( itemp < (vA).ul[i3 ) (vT).ul[i] = (ulong) Oxff fff fff ; \ 
else (vT).ul[i] = itemp; \ 

} \ 

} 
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ttdefine VMTCX vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = (vA).ul[i] & (vB).ul[i]; \ 

#define VANDC ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = (vA).ul[i] & ~(vB) .ul[i] ; \ 

#define VCMPEQFP( vT, vA, vB ) \ 

{ \ 

ulong i; \ 

for { i = 0; i < 4; i++ ) \ 

{vT).ul[i] - ( (vA).f[i3 =- (vB).f[i] ) ? Oxffffffff : 0; \ 

#define VCMPEQFP C{ vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
1^^ f = 0; \ 

13 for ( i = 0; i < 4; i++ ) { \ 

(vT).ul[i] = ( (vA).fCi] == (vB).f[i] ) ? Oxffffffff : 0; \ 
^t- t &= (vT) .ul [i] ; \ 

^0 f I- (vT) .ul[i] ; \ 

%0 } \ 

1^.: if ( t ) CR[6] = 0x8; \ 

l^: else if ( If ) CR[6] = 0x2; \ 

else CRE61 =0; \ 

i^i } 

#define VCMPEQUB{ vT, vA, vB ) \ 
{ \ 



C3 ulong 1; \ 

U 



m 



ijp for { i = 0; i < 16; i++ ) \ 

(vT).uc[i] = { {vA).uc[i] == {vB).uc[i] ) ? Oxff : 0; \ 

ttdefine VCMPEQUB_C{ vT, vA, vB ) \ 

n { \ 

ulong i; \ 
uchar t, f; \ 
t = Oxff; \ 
f = 0; \ 

for { i 0; i < 16; i++ ) { \ 

(vT).uc[i] = ( (vA).uc[i] == (vB).uc[i] ) ? Oxff : 0; \ 
t &= (vT) .uc[i] ; \ 
f 1=: (vT) .uc[i] ; \ 

I \ 

if { t ) CR[63 = 0x8; \ 
else if ( !f ) CR[6] = 0x2; \ 
else CR[6] = 0; \ 

#define VCMPEQXJH{ vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i3 = ( {vA).us[i] == (vB) .us [i] ) ? Oxffff : 0; \ 

idefine VCMPEQUH_C( vT, vA, vB ) \ 
{ \ 

ulong i ; \ 
ushort t , f ; \ 
t = Oxffff; \ 
f = 0; \ 

for ( i = 0; i < 8; i++ ) { \ 
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(vT).us[i] = ( (vA).us[i] == (vB).us[i] ) ? Oxffff : 0; \ 
t &= (vT) .us [i] ; \ 
f 1= (vT) ,us[i3 ; \ 

} \ 

if ( t ) CR[63 = 0x8; \ 

else if ( If ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPEQUW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ( {vA).ul[i3 == (vB).ul[i] ) ? Oxffffffff : 0; \ 

#define VCMPEQUW__C ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT).ul[i] = ( (vA).ul[i] == (vB).ul[i] ) ? Oxffffffff : 0; \ 
s . t &= (vT) .ul [i] ; \ 

f h (vT) .ul[i] ; \ 

if ( t ) CR[6] = 0x8; \ 

else if ( If ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

} 

in #define VCMPGEFP( vT, vA, vB ) \ 

{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ " 

(vT).ul[i] = ( (vA).f[i] >= (vB).f[i] ) ? Oxffffffff : 0; \ 

} , , 

#define VCMPGEFP_C ( vT, vA, vB ) \ 

{ \ 

ulong i; \ 
ulong t, f; \ 
=N t = Oxffffffff; \ 

£3 f = 0; \ 

f«s for { i = 0; i < 4; i + + ) { \ 

(vT).ul[i] = ( (vA).f[i] >= (vB).f[i] ) ? Oxffffffff : 0; \ 
t &= (vT) .ul [i] ; \ 
f U (vT) .ul [i] ; \ 

} \ 

if { t } CR[6] = 0x8; \ 

else if { if ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPGTFP ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

{vT).ulEi] = ( (vA).f[i] > (vB).f[i] ) ? Oxffffffff : 0; \ 

#define VCMPGTFP_C{ vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT).ul[i] = { {vA).f[i] > {vB).f[i} ) ? Oxffffffff : 0; \ 
t &= (vT) .ul [i] ; \ 
f 1= (vT) .ul [i] ; \ 

} \ 
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if ( t ) CR[63 = 0x8; \ 

else if { if ) CR[6] - 0x2; \ 

else CR[6] = 0; \ 

#define VCMPGTSB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT)^uc[i] = ( (vA).c[i] > (vB).c[i] ) ? Oxff : 0; \ 

#define VCMPGTSB_C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
uchar f; \ 
t = Oxff; \ 
f = 0; \ 

for { i - 0; i < 16; i++ ) { \ 

(vT).uc[i3 = ( (vA).c[i] > {vB).c[i] ) ? Oxff : 0; \ 
t &= (vT) .uc [i] ; \ 
f 1= (vT) .uc[i] ; \ 

) \ 

if ( t ) CRE63 = 0x8; \ 

else if ( if ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

S3 } 

13 #define VCMPGTSH( vT, vA, vB ) \ 

ulong i; \ . ^ x 

%Q for ( i = 0; i < 8; i++ ) \ - ^^^^ ^ . 

(vT).us[i] = ( {vA).s[i] > (vB).sEi] ) ? Oxffff : 0; \ 

} 

#define VCMPGTSH_C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
^ ushort t, f ; \ 

t = Oxffff; \ 
U f = 0; \ 

for ( i = 0; i < 8; i++ ) { \ 

(vT).us[i] = ( (vA),sEi] > {vB).s[i] ) ? Oxffff : 0; \ 
+ t Sc= (vT).us[i]; \ 

13 f 1= (vT) .us[i] ; \ 

I \ 

if ( t ) CR[6] = 0x8; \ 

else if ( if ) CRE6] = 0x2; \ 

else CR[6] = 0; \ 

} , , 

#define VCMPGTSW { vT, vA, vB ) \ 

{ \ 

ulong i; \ 

for ( i = 0 ; i < 4 ; i++ ) \ 

(vT).ul[i] = { (vA).lEi] > (vB).l[i] ) ? Oxffffffff : 0; \ 

#define VCMPGTSW_C{ vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ , 
(vT).ul[i] = ( (vA).lEi] > {vB).lEi] ) ? Oxffffffff : 0; \ 
t &= (vT) -ul [i] ; \ 
f 1= (vT) .ul [i] ; \ 

} \ 

if ( t ) CR[63 = 0x8; \ 

else if ( if ) CR[6] = 0x2; \ 

else CRE63 = 0; \ 

} 
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ttdefine VCMPGTOB ( vT, vA, vB ) \ 
{ \ 

ulong 1; \ 

for ( i = 0; i < 16; i++ ) \ ^ v 

(vT).uc[i] = ( (vA).uc[i] > (vB).uc[i3 ) ? Oxff : 0; \ 

#define VCMPGTUB_C{ vT, vA, vB ) \ 
{ \ 

ulong i; \ 
uchar t, f; \ 
t = Oxff; \ 
f = 0; \ 

for ( i = 0; i < 16; i++ ) { \ « v 

{vT).uc[i] = ( (vA).uc[i] > {vB).uc[i] ) ? Oxff : 0; \ 
t &= (vT) .uc [i] ; \ 
f 1= (vT) .uc[i] ; \ 

I \ 

if ( t ) CR[6] = 0x8; \ 

else if ( If ) CRL63 - 0x2; \ 

else CRt6] = 0; \ 

#define VCMPGTUH( vT, vA, vB ) \ 

Ls: { \ . . 

ulong i; \ . ^ v 

W for ( i = 0; i < 8; i++ ) \ 

13 (vT).usEi] = { (vA).us[i] > (vB).us[i] ) ? Oxffff : 0; \ 

#define VCMPGTUH__C( vT, vA, vB ) \ 
{ \ 

to ulong i; \ 

ushort t, f; \ 
J^: t = Oxffff; \ 

m f = 0; \ 

^ for { i = 0; i < 8; i++ ) { \ ^^^^ ^ v 

l^, (vT).us[i] = ( (vA).us[i3 > (vB).us[i] ) ? Oxffff : 0; \ 

I*! t &= (vT) .us [i] ; \ 

111 f 1= (vT) .us[i] ; \ 

lis: I \ 

if { t ) CR[6] = 0x8; \ 
else if ( If ) CR[6] = 0x2; \ 
P else CR[6] = 0; \ 

m } 

^define VCMPGTUWC vT, vA, vB ) \ 

{ \ 

ulong i; \ 

for ( i = 0 ; i < 4 ; i++ ) \ . 
(vT).ul[i] = ( (vA).ui[i] > (vB).ul[i3 ) ? Oxffffffff : 0; \ 

#define VCMPGTUW_C{ vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i - 0; i < 4; i++ } { \ ^^^^^^^^ r. \ 

{vT).ul[i] = ( (vA).ul[i3 > (vB).ul[a] ) ? Oxffffffff : 0; \ 
t (vT) .ul ti] ; \ 
f 1= (vT) .ul [i] ; \ 

} \ 

if ( t ) CR[61 = 0x8; \ 

else if ( If ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

ttdefine VCFSX( vT, vB, UIMM ) \ 
{ \ 

float fj; \ 
ulong i, j ; \ 
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j = (127 - ((UIMM) & Oxlf)) « 23; \ 

fj = * (float *)&j; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).f[i] = (float) ((vB) .l[i] } / fj; \ 

#define VCFUX( vT, vB, UIMM ) \ 
{ \ 

float fj; \ 
ulong j ; \ 

j = (127 - ((UIMM) Sc Oxlf)) « 23; \ 

fj = * (float *)&j; \ 

for ( i = 0; i < 4; i++ } \ 

(vT).f[i] = (float) ( (vB) .ul [i] ) / f j ; \ 

#define VCTSXS ( vT, vB, UIMM ) \ 
{ \ 

float f, g, max, scale; \ 
ulong i; \ 
long 1; \ 

i = (127 + 31) « 23; \ 
max = * (float *)&i; \ 

i = (127 + ((UIMM) & Oxlf)) « 23; \ 
scale = * (float *)&i; \ 
1*^=^ for ( i = 0; i < 4; i++ ) { \ 

O f = (vB) .f [i] ; \ 

ff^. g = f * scale; \ 

if ( g <= -max ) 1 = 0x80000000; \ 
else if ( g >= max ) 1 = Ox7fffffff; \ 
else 1 = (long)f « ((UIMM) & Oxlf); \ 
(vT) .l[i] = 1; \ 

» 1 '"^ 

lii #define VCTUXS ( vT, vB, UIMM ) \ 

! { \ 

Z . float f , g, max, scale; \ 

Cfe= ulong i, ul; \ 

|,s i = (127 + 32) << 23; \ 

max = * (float *)&i; \ 

i = (127 + ((UIMM) & Oxlf)) « 23; \ 
scale = * (float *)&i; \ 
fh for ( i = 0 ; i < 4 ; i++ ) { \ 

;p f = (VB) .f [i] ; \ 

g = f * scale; \ 
if ( g <= 0 ) ul - 0; \ 

else if ( g >= max ) ul = Oxffffffff; \ 
else ul = (ulong) f « ((UIMM) & Oxlf); \ 
(vT) .ul[i3 = ul; \ 

#define VEXPTEFP ( vT, vB ) \ 
{ \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).f[i] = exp (0.693147180559945 * (vB).f[i3); \ 

#define VLOGEFP( vT, vB ) \ 
{ \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).f[i] = 1.442695040888963 * log ( (vB) . f [i] ) ; \ 

#define VMADDFP ( vT, vA, vC, vB ) \ 
{ \ 

ulong i; \ 

float a, b, c, d; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (vA) .f [i] ; \ 

b = (vB) .f [i] ; \ 

c = (vC) .f [i] ; \ 
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d = a * c; \ 
d - b + d; \ 
(vT) .f [i] = d; \ 

#define VMAXFP{ vT, vA, vB ) \ 
{ \ 

ulong 1 ; \ 

for ( i = 0; i < 4; i++ ) \ 

{vT).f[i] = ((vA).f[i] >= {vB).f[i]) ? (vA).f[i] : (vB).f[i]; \ 

#define VMAXSB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for { i = 0; i < 16; i++ ) \ 

(vT).c[i] = ((vA).c[i] >= (vB).c[i]) ? (vA).c[i] : (vB).c[i3; \ 

#define VMAXSH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).s[i] = {(vA).s[i] >= (vB).s[i]) ? (vA).s[i] : (vB).s[i3; \ 

#define VMAXSW{ vT, vA, vB ) \ 

ulong i; \ 

-2" for ( i = 0; i < 4; i++ ) \ 

W (vT).l[i] = ((vA).l[i] >= (vB).l[i]) ? {vA).l[i] : (vB).l[i]; \ 

#define VMAXUB ( vT, vA, vB ) \ 

ulong i; \ 

iijl for ( i = 0; i < 16; i++ ) \ 

(vT).uc[i] = ((vA).uc[i] >= (vB).uc[i]) ? (vA).uc[i] : (vB).uc[i3; \ 

^ I 

W #define VMAXUH( vT, vA, vB ) \ 

W { \ , . , 

« . ulong i; \ 

= for ( i = 0; i < 8; i++ ) \ 

*P (vT).us[i] = ((vA).us[i3 >= {vB).usti3) ? (vA).us[i3 : (vB).usEi3; \ 

f*^- } 

#define VMAXUW ( vT, vA, vB ) \ 

{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i3 = {(vA).ul[i3 >= {vB).ul[i3) ? {vA).ul[i3 : (vB).ul[i3; \ 

} 

#define VMHADDSHS { vD, vA, vB, vC ) \ 

{ \ 

ulong 1 ; \ 
long a; \ 

for { i = 0; i < 8; i+H- ) { \ 

a = (long) (vA) .s[i3 * (long) (vB) . s [±3 ; \ 
a »= 15; \ 

a += (long) (vC) .s[i3 ; \ 
if ( a > 32767 ) a = 32767; \ 
else if ( a < -32768 ) a = -32768; \ 
{vD).s[i3 = (short)a; \ 

#define VMHRADDSHS ( vD, vA, vB, vC ) \ 
{ \ 

ulong i; \ 
long a; \ 

for ( i = 0; i < 8; i++ ) { \ 

a = (long) (vA) .s[i3 * (long) (vB) .s[i3 ; \ 
a += 0x00004000; \ 
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a »= 15; \ 

a += (long) (vC) .s[i] ; \ 

if { a > 32767 ) a = 32767; \ 

else if ( a < -32768 ) a = -32768; \ 

(vD),s[i] = (short)a; \ 

^ } \ 

#define VMINFP ( vT, vA, vB ) \ 
{ \ 

ulong 1; \ 

^^^(vTKi[ii = nvA)^fli] <= {vB).f[i]) ? (vA).f[il : (vB).f[i]; \ 

#define VMINSB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ v r . v 

(vT).c[i] = ((vA).c[i] <= (vB).c[i]) ? (vA).c[i] : (vB).c[i3; \ 

} 

#define VMINSH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ . . , , x r . -, v 

1^ (vT).s[i] = ((vA).s[i] <= (vB).s[i]) ? (vA).s[i] : (vB).s[x]; \ 

} 

fZ. #define VMINSW( vT, vA, vB ) \ 

:Z { \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ . , r . , „x r--. v 

(vT).l[i] = ({vA).l[i] <= (vB).l[i]) ? (vA).l[a] : (vB).l[i3; \ 

19 #define VMINUB ( vT, vA, vB ) \ 

W { \ 

ulong x; \ 

for ( i = 0; i < 16; i++ ) \ . r., / r-i \ 

(vT).uc[i] = {(vA).uc[i] <= (vB).uc[i]) ? (vA).uc[i3 : {vB).uc[i]; \ 

#define VMINUH ( vT, vA, vB ) \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ . , . x r • -. v 

zf. {vT).us[i] = {(vA).us[i] <= {vB).us[i]) ? (vA).us[i] : {vB).us[i]; \ 

iW } 

ttdefine VMINUW ( vT, vA, vB ) \ 

{ \ 

ulong 1 ; \ 

for { i = 0; i < 16; i++ ) \ . , x -, r ■ v 

(vT).ul[i] = ({vA).ul[i] <= (vB).ul[i3) ? (vA) .ul [i] : (vB) .ul [i] ; \ 

ttdefine VMLADDtJHM( vD, vA, vB, vC ) \ 

{ \ 

ulong i; \ 

ulong a, b, c; \ 

for ( i = 0; i < B; i++ ) { \ 

a = (ulong) (vA) .us[i] ; \ 

b = (ulong) (vB) .us[i] ; \ 

c = (ulong) (vC) .us [i] ; \ 

c (a * b) ; \ 

(vD).us[i] = (ushort)c; \ 

#define VMR( vD, vS ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 
(vD).ul[i] = (vS).ul[i]; \ 

} 



m 
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tdefine VMRGHB_BE ( vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong i, j; \ 

for ( i = 0; i < 8; i++ ) { \ 
j = i + i; \ 
v.uc [j] = (vA) .uc [i] ; \ 
v.uc[{j+l)] = (vB).uc[i]; \ 

^ \ . . ^ X 

for ( i = 0; 1 < 4; 1++ ) \ 

(vT) .ul [i] = v.ul[i] ; \ 

#define VMRGHH_BE ( vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong j ; \ 

for ( i = 0; i < 4; i++ ) { \ 
j = i + i; \ 
v.usLj] = (vA) .us[i] ; \ 
v.us[(j+l)] = (vB).us[i]; \ 

^ \ . . ^ X 

for ( i = 0; 1 < 4; i++ ) \ 

(vT) .ul[i] = v.ul[i3 ; \ 

#define VMRGHW_BE ( vT, vA, vB ) \ 

{ \ 

VMX reg v; \ 
ulong i, j; \ 

for ( i = 0; i < 2; i++ ) { \ 
j = i + i; \ 
v.ul[j] = (vA) •ulli] ; \ 
v.ul[(j + l)] = (vB).ul[i]; \ 

for ( i = 0; 1 < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

#define VMRGLB_BE ( vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong i, j ; \ 

for ( i = 0; i < 8; i++ ) { \ 
j = i + i; \ 

v.ucEj] = (vA) .uc[(8+i) ] ; \ 
v.ucE(j+l)l = (vB) .uc[(8+i) 3 ; \ 

^ \ ■ X X 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

#define VMRGLH_BE ( vT, vA, vB ) \ 

{ \ 

VMX reg v; \ 
ulong i, j; \ 

for ( i = 0; i < 4; i++ ) { \ 
j = i + i; \ 

v.usLj] = (vA) .us [ (4+i) ] ; \ 
v.us[(j+l)] = (vB) .us[{4+i)] ; \ 

^ \ ■ ^ X 

for ( i = 0; 1 < 4; i++ ) \ 

(vT) .ul [i] = v.ul [i] ; \ 

#define VMRGLW_BE ( vT, vA, vB ) \ 
{ \ 

VMX reg v? \ 
ulong i, j; \ 

for ( i = 0; i < 2; i++ ) { \ 
j = i + i; \ 

v.ul[j] = (vA) .ul[(2+i)] ; \ 
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v.ul[(j+l)] = (vB) .ul[(2+i)] ; \ 

} \ 

for ( i = 0/ i < 4; i++ ) \ 
(vT) .ul[i3 = v.ulEi] ; \ 

} 



#if defined { LITTLE__ENDIAN ) 



#def ine 


VMRGHB ( 


vT," 


vA, 


vB 


) 


VMRGLB BE( 




vB, 


vA 


#def ine 


VMRGHH ( 


vT, 


vA, 


vB 


) 


VMRGLH BE( 


vT, 


vB, 


vA 


#def ine 


VMRGHW ( 


vT, 


vA, 


vB 


) 


VMRGLW BE( 


vT, 


vB, 


vA 


#def ine 


VMRGLB ( 


vT, 


vA, 


vB 


) 


VMRGHB BE{ 


vT, 


vB, 


vA 


#def ine 


VMRGLH ( 


vT, 


vA, 


vB 


) 


VMRGHH BE{ 


vT, 


vB, 


vA 


#def ine 


VMRGLWC 


vT, 


vA, 


vB 


) 


VMRGHW_BE ( 


vT, 


vB, 


vA 


#else 




















#def ine 


VMRGHB { 


vT, 


vA, 


vB 


) 


VMRGHB BE( 


vT, 


vA, 


vB 


#def ine 


VMRGHH ( 


vT, 


vA, 


vB 


) 


VMRGHH BE( 


vT, 


vA, 


vB 


#def ine 


VMRGHW ( 


vT, 


vA, 


vB 


) 


VMRGHW BE( 


vT, 


vA, 


vB 


#def ine 


VMRGLB ( 


vT, 


vA, 


vB 


) 


VMRGLB BE( 


vT, 


vA, 


vB 


#def ine 


VMRGLH ( 


vT, 


vA, 


vB 


) 


VMRGLH BE( 


vT, 


vA, 


VB 


#def ine 


VMRGLW ( 


vT, 


vA, 


vB 


) 


VMRGLW BE( 


vT, 


vA, 


vB 



#endif 



#define VMSUMMBM{ vT, vA, vB, vC ) \ 
{ \ 

ulong i, j; \ 
long a, c; \ 
\Q ulong b; \ 

.'S for ( i = 0; i < 4; i++ ) { \ 

C - (vC) .1 [i] ; \ 
IS for ( j = 0; j < 4; j++ ) { \ 

tg a = (long) (vA) .c [4*i+j] ; \ 

i% b = (ulong) (vB) .uc[4*i+j3; \ 

c (a * b) ; \ 

I \ 

f«?i (vT) .1 [i] = c; \ 

} \ 

m } 

#define VMSUMSHM{ vT, vA, vB, vC ) \ 

^U. ulong 1, 3; \ 

W long a, b, c; \ 

fli for ( i = 0; i < 4; i++ ) { \ 

c = (vC) .l[i] ; \ 

for ( j = 0; j < 2; j++ ) { \ 
a = (long) (vA) . s [4*i+j] / \ 
b= (long) (vB) .s[4*i+j3 ; \ 
c += (a * b) ; \ 

} \ 

(vT) .l[i] = c; \ 

#define VMSUMSHS ( vT, vA, vB, vC ) \ 
{ \ 

ulong i, j; \ 
long a, b; \ 
double c; \ 

for ( i = 0; i < 4; i++ ) { \ 
c = (double) (vC) .1 [i] ; \ 
for ( j = 0; j < 2; j++ ) f \ 

a = (long) (vA) .s [4*i+j] ; \ 

b= (long) (vB) .s [4*i+j] ; \ 

c += (double) (a * b) ; \ 

] \ 

if ( c >= 2147483647.0 ) c = 2147483647.0; \ 

else if ( c <= -2147483648.0 ) c = -2147483648.0; \ 

(vT) .lEi3 = (long)c; \ 

} \ 
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#define VMSUMUBMC vT, vA, vB, vC ) \ 

^ \ . . V 

ulong X , J 'r \ 

ulong a, b, c; \ 
for ( i = 0; i < 4; i++ ) { \ 
c = (vC) .ul[i] ; \ 
for ( j = 0; j < 4; j++ ) { \ 
a = (ulong) (vA) .uc [4*i+j] ; \ 
b = (ulong) (vB) .uc [4*i+j3 ? \ 
c += (a * b) ; \ 

} \ 

(vT) .ul [i] = c; \ 

#define VMSUMUHM( vT, vA, vB, vC ) \ 
{ \ 

ulong X, 3; \ 
ulong a, b, c; \ 
for ( i = 0; i < 4; i++ ) { \ 
c = (vC) .ul [i] ; \ 
for ( j = 0; j < 2; j++ ) { \ 
a = (ulong) (vA) .us[4*i+j] ; \ 
H= b = (ulong) (vB) .us[4*i+j] ; \ 

a c += (a * b) ; \ 

B ) \ 

(vT) .ul[i] = c; \ 

m } \ 

1=0 } 

fft ttdefine VMSUMUHS( vT, vA, vB, vC ) \ 

{ \ 

tf- ulong i , j ; \ 

llj ulong a, b; \ 

double c; \ 

for ( i = 0; i < 4; i++ ) { \ 
S c = (double) (vC) .ul [i] ; \ 

Iss for ( j = 0; j < 2; j++ ) { \ 

a = (ulong) (vA) ,us[4*i+j] ; \ 
b= (ulong) (vB) .us [4*i+j} ; \ 
*P c (double) (a * b) ; \ 

fh } \ 

1^, if ( C >= 4294967295-0 ) C = 4294967295.0; \ 

(vT).ul[i] = (ulong)c; \ 

tdefine VMULESB ( vT, vA, vB ) \ 

{ \ 

ulong 1 ; \ 
long a, h, c; \ 

for ( i = 0; i < 8; i++ ) { \ 
a = (long) (vA) .c[2*i] ; \ 
b = (long) (vB) .c[2*i] ; \ 
c = a * b; \ 
(vT) .s[i] = (short) c; \ 

#define VMULESH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
long a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 
a = (long) (vA) .s[2*i] ; \ 
b = (long) (vB) .s[2*i3 ; \ 
c = a * b; \ 
(vT) .1 [i] = (long)c; \ 
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#define VMULEUB( vT, vA, vB ) \ 

{ \ 

ulong i; \ 

ulong a, b, c; \ 

for ( i = 0; i < 8; i++ ) { \ 
a = (ulong) (vA) .uc[2*i] ; \ 
b = (ulong) (vB) .uc [2*i] ; \ 
c = a * b; \ 

(vT).us[i] = (ushort)c; \ 

#define VMULEUH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

ulong a, b, c; \ 

for { i - 0; 1 < 4; i++ ) { \ 

a = (ulong) (vA) .us [2*i3 ; \ 

b = (ulong) (vB) .us[2*i] ; \ 

c = a * b; \ 

(vT).ul[i] = (ulong) c; \ 

#define VMULOSB( vT, vA, vB ) \ 
{ \ 

Q' ulong i; \ 

;S long a, b, c; \ 

for ( i = 0; i < 8; i++ ) { \ 
%0 a = (long) (vA) .c[2*i+l3 ? \ 

- b = (long) (vB) .c[2*i+l] ; \ 

i c a * b; \ 

W {vT).sEi3 = (short)c; \ 

R . 

#define VMULOSH( vT, vA, vB ) \ 

!. { \ 

13 ulong i; \ 

1,11 long a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 
1^ a = (long) (vA) .s[2*i+l] ; \ 

.g b = (long) (vB) .s[2*i+l] ; \ 

c = a * b; \ 



H (vT).l[i] = (long)c; \ 

} \ 

#define VMULOUB { vT, vA, vB ) \ 

{ \ 

ulong i; \ 

ulong a, b, c; \ 

for ( i = 0; i < 8; i++ ) { \ 
a = (ulong) (vA) .uc[2*i+l] ; \ 
b = (ulong) (vB) .uc[2*i+l] ; \ 
c = a * b; \ 

(vT).us[i] = (ushort)c; \ 

#define VMULOOHC vT, vA, vB ) \ 

{ \ 

ulong i; \ 

ulong a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a (ulong) (vA) .us[2*i+l] ; \ 

b= (ulong) (vB) .us [2*i+l] ; \ 

c = a * b; \ 

(vT).ul[i] = {ulong)c; \ 

#define VNMSUBFP ( vT, vA, vC, vB ) \ 
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ulong i; \ 

float a, b, c, d; \ 
for ( i = 0; i < 4; i++ ) { \ 

a = (vA) .f [i] ; \ 

b = (vB) .f [i] ; \ 

c = (vC) .f [i] ; \ 
d = a * c; \ 
d = b - d; \ 

(vT) .f [i] = d; \ 

ttdefine VNOR( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

{vT).ul[i] = -((vA).ul[i] I (vB).ul[i]); \ 

#define VNOT( vT, vA ) VNOR( vT, vA, vA ) 

#define VOR( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

5 . for ( i = 0; i < 4; i++ ) \ 

'^'^ {vT).ul[i] = (vA).ul[i] I (vB).ul[i3; \ 

f ^ . 

r*1 #define VPERM_BE( vT, vA, vB, vC ) \ 

VMX reg v; \ 
%U ulong field, i; \ 

|n= for ( i = 0; i < 16; i++ ) { \ 

;S field = (vC).uc[i]; \ , ^ r^. -.^ -.^i \ 

V uc[i] = { field < 16 ) ? (vA) .uc [field] : (vB) .uc [field - 16]; \ 

W } \ ... X 

for ( i = 0; 1 < 4; Jl++ ) \ 

(vT) .ul [i] = v.ul [i] ; \ 

III #define VPKUHUM__BE( vT, vA, vB, base ) \ 

|i { \ 

' VMX reg v; \ 

*p ulong i , j ; \ 

23 j = base; \ 

ffs for ( i = 0; i < 8; i + + ) { \ 

v.uc[i] - (vA) .uc[(j)3 ; \ 
v.uc[i+83 = (vB) .uc [ { j) ] ; \ 
j += 2; \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

#define VPKUHUS_BE{ vT, vA, vB, base ) \ 

{ \ 

VMX reg v; \ 
ulong i, j; \ 
j = base; \ 

for ( i = 0; i < 8; i++ ) { \ r,... v 

v.uc[i3 = (vA) .uc[(j"l)] ? (uchar)255 : (vA) . uc [ ( j ) ] ; \ 
v.uc[i+83 = (vB) .uc[(j^l)3 ? (uchar)255 : (vB) .uc [ ( j ) ] ; \ 

} \ ^' 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

#define VPKSHUS_BE ( vT, vA, vB, base ) \ 

{ \ 

VMX reg v; \ 
ulong i , j ; \ 
j = base; \ 
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for { i = 0; i < 8; i++ ) { \ 

if ( (vA).s[i] <= 0 ) v.uc[i] = 0; \ 

else if ( (vA).s[i] >= 255 ) v.uc[i] = 255; \ 

else v.uc[i] = (vA).uc[j]; \ 

if ( (vB).s[i] <= 0 ) v.uc[i+8] = 0; \ 

else if ( (vB).s[i] 255 ) v.uc[i+8] = 255; \ 

else v.uc[i+8] = (vB).uc[j]; \ 

j += 2; \ 

for { i = 0; i < 4; i++ ) \ 
(vT) .ul[i] = v.ul[i] ; \ 

#define VPKSHSS_BE( vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i, j; \ 

j = base; \ . . r V 

for ( i = 0; i < 8; 1++ ) { \ 

if ( (vA).s[i] <= -128 ) v.c[i] = -128; \ 

else if { (vA).s[i] >= 127 ) v.c[i] = 127; \ 

else v.c[i3 = (vA).c[j]; \ 

if ( (vB).s[i] <= -128 ) v.c[i+8] = -128; \ 
else if ( (vB).s[i] >= 127 ) v.c[i+8] = 127; \ 
else v.c[i+83 = (vB).c[j]; \ 
C3 j += 2; \ 

for { i = 0; i < 4; 1++ ) \ 
m (vT).ul[i] =v.ul[i]; \ 

i^fl } 

m #define VPKUWUM BE( vT, vA, vB, base ) \ 

^ { \ 

19 VMX reg v; \ 

111 ulong i, j ? \ 

j = base; \ 

!^ for ( i = 0; i < 4; i++ ) { \ 

13 v.usEi] = (vA) .us[(j)] ; \ 

1^1 v,us[i+4] = (vB) .us[(j)] ; \ 

j += 2; \ 

1 \ 

*P for ( i = 0; i < 4; i++ ) \ 

f*| (vT).ul[i] = v.ul[i]; \ 

fU #define VPKUWUS_BE ( vT, vA, vB, base ) \ 

{ \ 

VMX reg v; \ 
ulong i, j; \ 
j = base; \ 

for ( i = 0 ; i < 4 ; i++ ) { \ 

v.us[i} = (vA) .us[(j"l)] ? (ushort) 65535 : (vA) .us [ ( j ) ] ; \ 
v.us[i+4] = (vB) .us[(j^l)] ? (ushort) 65535 : (vB) .us [ ( j ) ] ; \ 
j +-2; \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul[i] = v.ul[i] ; \ 

#define VPKSWUS_BE{ vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i , j ; \ 
j r= base; \ 

for ( i = 0; i < 4; i++ ) { \ 

if ( (vA).l[i] <= 0 ) v.us[i] - 0; \ 

else if ( {vA).l[i] >= 65535 ) v.us[i3 = 65535; \ 

elsev-us[i] = (vA).us[j]; \ 

if ( (vB).l[i] <= 0 ) v,us[i+4] = 0; \ 

else if ( {vB).lEi3 >= 65535 ) v.us[i+4] = 65535; \ 

else v.us[i+4] = (vB).us[j]; \ 
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j += 2; \ 

} \ 

for ( 1 = 0; i < 4; 1++ ) \ 
(VT) .Ul[i] = V.ul[i] ; \ 

} 

#define VPKSWSS_BE( vT, vA, vB, base ) \ 



{ \ 



} 



VMX reg v; \ 
ulong i, j; \ 
j = base; \ 

for ( i = 0; i < 8; i++ ) { \ 

if { (vA).l[i] <= -32768 ) v.s[i] = -32768; \ 
else if ( (vA).l[i] >= 32767 } v.s[i} = 32767; \ 
else v.s[i] = {vA).s[j]; \ 

if { (vB),l[i] <= -32768 ) v.s[i+8] - -32768; \ 
else if ( (vB).l[i] >= 32767 ) v.s[i+8] = 32767; \ 
elsev.s[i+8] = (vB).s[j]; \ 
j += 2; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ulEi] = v.ul[i] ; \ 



#if defined ( LITTLE ENDIAN ) 
#define VPERM( vT, vA, vB, vC ) 

#define VPKUHUM( vT, vA, vB ) 

#define VPKUHUS( vT, vA, vB ) 

#define VPKSHUS( vT, vA, vB ) 

#define VPKSHSS ( vT, vA, vB ) 

#define VPKUWUM( vT, vA, vB ) 

#define VPKUWUS ( vT, vA, vB ) 

#define VPKSWUS { vT, vA, vB ) 

#define VPKSWSS ( vT, vA, vB ) 
#else 

#define VPERM( vT, vA, vB, vC ) 

#define VPKUHUM{ vT, vA, vB ) 

#define VPKUHUS ( vT, vA, vB ) 

#define VPKSHUS ( vT, vA, vB ) 

#define VPKSHSS ( vT, vA, vB ) 

#define VPKUWUM{ vT, vA, vB ) 

#define VPKUWUS ( vT, vA, vB ) 

#define VPKSWUS { vT, vA, vB ) 

#define VPKSWSS ( vT, vA, vB ) 
#endif 



VPERM BE( vT, vB, vA, vC ); 

VPKUHUM BE{ vT, vB, vA, 0 ) 

VPKUHUS BE( vT, vB, vA, 0 ) 

VPKSHUS BE( vT, vB, vA, 0 ) 

VPKSHSS BE( vT, vB, vA, 0 ) 

VPKUWUM BE( vT, vB, vA, 0 ) 

VPKUWUS BE( vT, vB, vA, 0 ) 

VPKSWUS BE( vT, vB, vA, 0 ) 

VPKSWSS_BE{ vT, vB, vA, 0 ) 

VPERM BE( vT, vA, vB, vC ); 



VPKUHUM BE( vT, vA, vB, 

VPKUHUS BE( vT, vA, vB, 

VPKSHUS BE( vT, vA, vB, 

VPKSHSS BE( vT, vA, vB, 

VPKUWUM BE{ vT, vA, vB, 

VPKUWUS BE( vT, vA, vB, 

VPKSWUS BE( vT, vA, vB, 



1 ) 

1 ) 

1 ) 

1 ) 



VPKSWSS BE( vT, vA, vB, 1 ) 



ttdefine VREFP( vT, vB ) \ 
{ \ 

for { i = 0; i < 4; i++ ) \ 

(vT) .f [i] - 1.0 / (vB) .f [i] ; \ 

} 

#define VRFIM( vT, vB } \ 
{ \ 

float f, max, r; \ 
ulong i ; \ 

i = (127 + 31) « 23; \ 
max = * (float *}Sci; \ 
for ( i = 0 ; i < 4 ; i++ ) { \ 
f = (vB) .f [i] ; \ 

if { (f >= -max) && (f < max) ) { \ 
r = (float) ((long)f) ; \ 
if ( r > f ) --r; \ 
f = r; \ 

} \ 

(vT) .f [13 = f ; 
#define VRFIN( vT, vB } \ 



\ 
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float f, r, s; \ 
ulong i; \ 
long Ir; \ 

for ( i = 0; i < 4; i++ ) { \ 
s - f = (vB) .f [i] ; \ 
if ( f < 0.0 ) f = -f; \ 
r = f + 0.5; \ 
if ( r != f ) { \ 

Ir = (long)r; \ 

f = (float) Ir; \ 

if ( f =- r ) f = (float) (Ir & -1) ; \ 

if^( s < 0.0 ) f = -f; \ 
(vT) .f [i] = f ; \ 

#define VRFIP( vT, vB ) \ 
{ \ 

float f, max, r; \ 
ulong i; \ 

i = (127 + 31) « 23; \ 

max = * (float *)&i; \ 
|s5 for ( i = 0; i < 4; i++ ) { \ 

f = (vB) .f [i] ; \ 
:S if ( (f >= -max) && (f < max) ) { \ 

W r = (float) ((long) f) ; \ 

%0i if ( r < f ) ++r; \ 

.f^ f = r; \ 

-^^^ (vT) .f [i] = f ; \ 

m } \ 

id } 

^"^^ #define VRFIZ( vT, vB ) \ 

{ \ 

13 ' float f, max; \ 

ulong i; \ 

r^^' i = (127 + 31) « 23; \ 

1*-^ max = * (float *)&i; \ 

£ for ( i = 0; i < 4; i++ ) { \ 

f = (vB) .f [i] ; \ 

if ( (f >= -max) && (f < max) ) \ 
fU f = (float) ((long)f) ; \ 

(vT) .f [i] = f ; \ 

ttdefine VRLB( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 16; i++ ) { \ 
sh = (vB).uc[i] & 0x7; \ 

(vT).uc[i] = ((vA).uc[i] « sh) | ((vA).uc[i] » (8-sh) ) ; \ 

ttdefine VRLH( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 8; i++ ) { \ 
sh = (vB).us[i] & Oxf; \ 

(vT).us[i] = ((vA).us[i] « sh) | ((vA).us[i] » (16-sh) ) ; \ 

ttdefine VRSQRTEFP{ vT, vB ) \ 

for { i = 0; 1 < 4; 1++ ) \ 

(vT),f[i3 = 1.0 / sqrt( (vB) .f [i] ) ; \ 

} 
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ftdefine VRLW( vT, vA, vB ) \ 

{ \ 

ulong i, sh; \ 

for ( i = 0; i < 4; i++ } { \ 
sh = (vB).ul[i] & Oxlf; \ 
(vT).ul[i] - ((vA).ul[i] « sh) 

idefine VSEL( vT, vA, vB, vC ) \ 

{ \ 

ulong a temp, btettip, i; \ 

for ( i = 0; i < 4; i++ ) { \ 

ateitip = (vA).ul[i] & -{vC).ul[i]; \ 
btemp = (vA).ul[i] & (vC).ul[i]; \ 
(vT).ul[i] = atetnp 1 btetnp; \ 

} \ 



{(vA).ul[i] » (32-sh)); \ 



} 

#def 
{ 



ine VSL( vT, vA, vB ) \ 
\ 

ulong i, sh; \ 
sh = (vB) .ul [3] & 0x7; \ 
(vT).ullO] = ({vA).ul[0] « sh) 
(vT).ul[l] = ({vA).ul[l] « sh) 
(vT).ul[2] = ((vA).ul[23 « sh) 
(vT).ul[3] = (vA).ul[3] « sh; \ 
} 

#define VSLDOI ( vT, vA, vB, UIMM ) \ 
{ \ 

VMX reg v; \ 

ulong sh; \ 

sh = (UIMM) Sc Oxf; \ 

for ( i = 0; i < (16-sh) ; i++ ) \ 

v.uc[i] = (vA) . uc [i+sh] ; \ 
for ( j = i; j < 16; j++ ) \ 
v.ucEj] = (vB) .uc [j-i] ; \ 
for ( i = 0; i < 4; ) \ 

(VT) .ulEi] = v.ul[i]; \ 

ine VSLB( vT, vA, vB ) \ 
\ 

ulong i, sh; \ 
for ( i = 0; i < 16; i++ ) { \ 
sh = (vB).uc[i] & 0x7; \ 
(vT).uc[i] = (vA).uc[i] « sh; . \ 

, *^ 

ttdefine VSLH( vT, vA, vB ) \ 

{ \ 

ulong i, sh; \ 

for ( i = 0; i < 8; i++ ) { \ 
sh = (vB)»us[i] & Oxf; \ 
(vT).us[i] = {vA)-us[i3 « sh; \ 

} \ 

ine VSLO( vT, vA, vB ) \ 



} 

#def 
{ 



} 

#def 
{ 



\ 



} 

#def 
{ 



((VA) .Ul[l] » (32-sh)) ; \ 
((vA).ul[2] » (32-sh)); \ 
((vA).ul[3] » (32-sh)); \ 



ulong i, sh; \ 

sh = ((vB).ul[3] >> 3) & Oxf; \ 

for ( i = 0; i < (16-sh); i++ ) \ 

{vT)-uc[i] = (vA) .uc [i+sh] ; \ 
for ( j = i; j < 16; j++ ) \ 

(vT) .uc[jl = 0; \ 

ine VSLW{ vT, vA, vB ) \ 
\ 

ulong i, sh; \ 

for ( i = 0; i < 4; 1++ ) { \ 
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} 



} \ 



sh = (vB).ul[i] & Oxlf; \ 
(vT).ul[i] = (vA).ul[i] « sh; \ 



#define VSR( vT, vA, vB ) \ 

ulong i, sh; \ 
Sh = (vB) .ul[3] & 0x7; \ 
{vT).ul[3] = ((vA).ul[3] » Sh 
(vT).ul[2l = {(vA).ul[2] » sh) 
{vT).ul[l] = ((vA).ul[l] » sh) 
(vT).ul[0] = (vA).ul[0] » sh; \ 

#define VSRAB ( vT, vA, vB ) \ 

ulong 1, sh; \ . x r \ 

for ( i = 0; i < 16; 1++ ) { \ 
sh = (vB).uc[i] Sc 0x7; \ 
{vT).c[i] = (vA).c[i] » sh; \ 

} \ 



((vA) .ul[2] « {32-sh)) ; \ 
( (vA) .ul[l] « (32-sh) ) ; \ 
((vA).ul[0] « (32-sh)); \ 



I 



vA, vB ) \ 



#define VSRAH( vT, 

^ \ . . X 

ulong 1, sh; \ 

for ( i = 0; i < 8; i++ ) { \ 
sh = (vB).us[il & Oxf; \ 
(vT).s[i] = (vA).s[i3 » sh; \ 

#define VSRAW( vT, vA, vB ) \ 

ulong X, sh; \ ^ r \ 

for ( i = 0; i < 4; i++ ) | \ 
sh = (vB).ul[i] & Oxlf; \ 
{vT).l[i3 = (vA).l[i] » sh; \ 

#define VSRB{ vT, vA, vB ) \ 

t \ • . X 

ulong 1, sh; \ . t \ 

for ( i = 0; i < 16; i++ ) { \ 
sh = (vB).uc[i] & 0x7; \ 
(vT).uc[i] = (vA).uc[il » sh; \ 

} \ 

^define VSRH( vT, vA, vB ) \ 

^ \ ^ . X 

ulong 1, sh; \ . ^ r x 

for { i = 0; i < 8; i++ ) { \ 
sh = (vB) .us [i] & Oxf; \ 
(vT).us[i] = (vA).us[i] » sh; \ 

^ } \ 

#define VSRO( vT, vA, vB ) \ 

^ \ . • \ 

long 1, sh; \ ^ ^ \ 

sh = {(vB).ul[3] » 3) & Oxf; \ 
for ( i = 15; i >= sh; i-- ) \ 

(vT).ucti3 = (vA) .uc[i-shj ; \ 
for { j = i; j >= 0; j-- ) \ 

(vT) .uc [j] = 0; \ 

#define VSRW( vT, vA, vB ) \ 
ulong 1/ sh; \ 

for { i - 0; i < 4; i++ ) | \ 
sh = (vB).ul[i3 & Oxlf; \ 



44 



Page No. 327 



EV 093 931 797 US 

(vT).ul[i3 = (vA).ul[i3 » sh; \ 



Page No. 354 

salppc.h 



} \ 

#define VSPLTB( vT, vB, UIMM ) \ 
{ \ 

uchar c; \ 

ulong i; \ . « v 

c = (vB) .uc[C INDEX MUNGE ( UIMM ) & Oxf ] ; \ 
for ( i = 0; i < 16; i++ ) \ 
(vT) .uc[i] = c; \ 

ttdefine VSPLTH{ vT, vB, UIMM ) \ 
{ \ 

ushort s; \ 

ulong i; \ . « v 

s = (vB) .us[S INDEX__MUNGE ( UIMM ) & 0x7]; \ 
for ( i = 0; i < 8; i++ ) \ 
(vT) .us [i] = s; \ 

#define VSPLTW( vT, vB, UIMM ) \ 

ulong i, 1; \ . ^ v 

1 = (vB).ul[b INDEX_MUNGE( UIMM ) & 0x3]; \ 
for ( i = 0; i < 4; i++ ) \ 



U (vT) .ul[i] = 1; \ 

£*1 ) 

ttdefine VSPLTISB( vT, SIMM ) \ 

w ulong i; \ 

ffl for ( i = 0; i < 16; i++ ) \ 

(vT).c[i3 = (char) (SIMM) ; \ 

W #define VSPLTISH( vT, SIMM ) \ 

ulong i; \ 

M for ( i = 0; i < 8; i++ ) \ 

(vT).s[i] = (short) (SIMM) ; \ 

Is5: ) 

V' #define VSPLTISW( vT, SIMM ) \ 

*P { \ 

W ulong i; \ 

m for ( i = 0; i < 4; i++ ) \ 

(vT).l[i3 = (long) (SIMM) ; \ 

#define VSUBFP ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

float a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (vA) ,f [i] ; \ 

b = (vB) .f [i] ; \ 

c = a - b; \ 

(vT) .f [i] = c; \ 

^ } \ 

#define VSUBSBS ( vT, vA, vB ) \ 

{ \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 16; i++ ) { \ . r . 

itemp = (long) (vA) .c[i] - (long) (vB) .c[i] 
if ( itemp < -128 ) (vT).c[i] = -128; \ 
else if ( itemp > 127 ) (vT).c[i3 - 127; 
else (vT).c[i] == (char) itemp; \ 

^ } \ 

#define VSUBSHS( vT, vA, vB ) \ 
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ulong i; \ 
long itemp; \ . r x 

for ( i = 0; i < 8; i++ ) { \ , , ^^ r-i \ 

itemp = (long) (vA) .s[il " (l^ng) (vB) s[i] ; \ 
if ( itemp < -32768 ) (vT) s[i] = . "^2768 ; \ 
else if ( itemp > 32767 ) (vT).s[i] = 32767, \ 
else (vT).s[i] = (short) itemp; \ 

#define VSUBSWS{ vT, vA, vB ) \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 4; i++ ) \ 

ifT(=(ilf.lU?^-: irU'l'U-lU] < 0) (itenrp < 0) ) \ 
elsri^i^^vA;a[?S°"ofll1^vm > 0) (ite.p > 0) ) \ 

(vT).l[i] = (long) 0x80000000; \ 
else (vT) .1 = itemp[i]; \ 



#define VSUBUBM{ vT, vA, vB ) \ 



^ . ulong i; \ . . v 

:S for ( i = 0; i < 16; x++ \ 

{vT).uc[i3 = (vA).uc[i3 - (vB).uc[i]; \ 

} 

#define VSUBUBS ( vT, vA, vB ) \ 

f A ^ , ■ \ 

ulong i; \ . v r v 

i-ll for ( i = 0; i < 16; i++ ) { \ ^ ^ . 

r if ( (vA).uc[i3 <= (vB).uc[i3 ) (vT).ucU] = 0; \ 

else (vT).uc[i] = (vA).uc[i] - {vB).uc[i]? \ 

M } \ 

m } 

1^ #define VSUBUHM( vT, vA, vB ) \ 

{ \ , 

*p ulong i;- \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i] = (vA).us[i] - (vB) .us [i] ; \ 

} 

#define VSUBUHS ( vT, vA, vB ) \ 
{ \ 

ulong i; \ . ^ r v 

for ( i = 0; i < 8; i++ ) ^ \ ^ ^.^ ^ . 

if ( (vA).us[i3 <= (vB).us[i] ) (vT).us[i] = 0; \ 
else (vT).us[i3 = (vA).us[i] - (vB).us[i]; \ 

) \ 

#define VSUBUWM( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0 ; i < 4 ; i++ ) \ 

(vT).ul[i] = (vA).ul[i] - (vB).ul[i]; \ 

#define VSUBUWS ( vT, vA, vB ) \ 
{ \ 

ulong i; \ ^ ^ 4. ) { \ 

^"""if ] ^vi).ilU]'*<"VB).ul[i] ) (vT).ul[i3 = 0; \ 
else (vT).ul[i3 = {vA).ul[i] - (vB),ulLi]; \ 

#define VSUMSWS ( vT, vA, vB ) \ 
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{ \ 



ulong i; \ 
double sum; \ 

sum = (double) (vB) .1 [L INDEX_iyiUNGE ( 3 )]; \ 
for ( i = 0; i < 4; i++ ) \ 

sum += (double) (vA) .1 [i] ; \ 
if ( sum > (double) (0x7fffffff ) ) \ 

(vT).l[L INDEX MUNGE( 3 ) ] 0x7fffffff; 
else if ( sum < (double) (0x80000000) ) \ 



\ 



} 



(vT) . 1 [L_INDEX_MUNGE ( 
else \ 

(vT) . 1 [L INDEX MUNGE ( 



3 )3 = 
3 )3 = 
\ 



0x80000000; \ 
(long) sum; \ 



#define VSUM2SWS ( vT, vA, vB ) 

{ \ 

ulong i; \ 

double suml, sum2; \ 

suml = (double) (vB) .1 [L INDEX MUNGE { 1 )]; 
SUm2 = (double) (vB) .1 [L_INDEX_MUNGE( 3 )]; 
for ( i = 0; i < 2; i++ ) { \ 

suml += (double) (vA).l[L INDEX MUNGE ( i 



\ 
\ 

)3; \ 



sum2 += (double) (vA) .1[L INDEX MUNGE ( i+2 )3; \ 

} \ ~ " 

if ( suml > (double) (0x7fffffff ) ) \ 

(vT).l[L INDEX MUNGE( 1 )] = 0x7fffffff; \ 
else if ( suml < (double) (0x80000000) ) \ 

( vT) . 1 [L_INDEX_MUNGE ( 1 )] = 0x80000000; \ 
else \ 

(vT) . 1 [L_INDEX MUNGE ( 1 )] = (long) suml; \ 
if ( sum2 > (double) (0x7fffffff) ) \ 

(VT).1[L INDEX MUNGE( 3 )] = 0x7fffffff; \ 
else if ( sum2 < (double) (0x80000000) ) \ 



} 



(vT) . 1 [L_INDEX_MUNGE ( 
else \ 

(vT) .1 [L_INDEX MUNGE ( 



3 ) ] = 0x80000000 ; 



3 )] = (long)sum2; \ 

\ 



#define VSUM4SBS ( vT, vA, vB ) 
{ \ 

ulong i , j ; \ 
double sum; \ 

for ( i = 0; i < 4; i++ ) { \ 
sum == (double) (vB) .l[i] ; \ 
for ( j = 0; j < 4; j++ ) { \ 

sum += (double) (vA) .c[4*i + j] ; 
if < sum > (double) (0x7fffffff ) 

{vT).l[i3 = 0x7fffffff; \ 
else if ( sum < (double) (0x80000000) ) \ 

(vT).l[i3 = 0x80000000; \ 
else \ 

(vT).l[i3 = (long) sum; \ 



\ 

) \ 



} 



} \ 



} \ 



#define VSUM4SHS( vT, vA, vB ) \ 
{ \ 

ulong 1, j; \ 
double sum; \ 

for ( i = 0; i < 4; i++ ) { \ 
sum = (double) (vB) .l[i] ; \ 
for ( j = 0; j < 2; j++ ) { \ 

sum += (double) (vA) .s[2*i + j3 ; \ 
if ( sum > (double) {0x7fffffff ) ) \ 

{vT).l[i] = Ox7fffffff; \ 
else if ( sum < (double) (0x80000000) ) \ 

(vT).l[i3 = 0x80000000; \ 
else \ 

{vT).l[i3 = (long) sum; \ 
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} 



} \ 



} \ 



#define VSUM4UBS ( vT, vA, vB ) \ 

{ \ . . , 

ulong 1 , j; \ 

double sum; \ 

for { i = 0; i < 4; i++ ) { \ 
sum = (double) (vB) .ul [i] ; \ 
for ( j = 0; j < 4; j++ ) { \ 

sum += (double) (vA) .uc [4*1 + ^ ^. . 

if ( sum > (2.0 * (double) (0x7fffffff) + 1.0) ) 

(vT).ul[i] = Oxffffffff; \ 
else \ 

(vT).ul[i3 = (ulong) sum; \ 

} \ 

#define VUPKHSB_BE( vT, vB ) \ 
{ \ 

long i; \ ^ ^ • x \ 
for ( i = 7; i >= 0; i-- ) \ 

(vT) .s[i] = (short) (vB) .c[i] ; \ 

ttdefine VUPKHSH_BE ( vT, vB ) \ 

{ \ 

long i; \ . ^ \ 

for ( X = 3; i >= 0; 1-- ) \ 

(vT).l[i] = (long) (vB) .s [i3 ; \ 

#define VUPKLSB__BE( vT, vB ) \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).s[i] = (short) (vB) .c [1+8] ; \ 

#define VUPKLSH_BE( vT, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ _ 

(vT) .l[i] = (long) (vB) .s[i+4] ; \ 

} 

#if defined ( LITTLE ENDIAN ) 
#define VUPKHSB ( vT, vB ) 
#define VUPKHSH( vT, vB ) 
#define VUPKLSB( vT, vB ) 
tdefine VUPKLSH( vT, vB ) 
#else 

ttdefine VUPKHSB ( vT, vB ) 
#define VUPKHSH ( vT, vB ) 
#define VUPKLSB ( vT, vB ) 
#define VUPKLSH( vT, vB ) 
ttendif 

#define VXOR( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \^ 
(vT) .ul[i3 = (vA) .ul[i] 

} 



VUPKLSB BE( vT, vB ); 

VUPKLSH BE( vT, vB )i 

VUPKHSB BE( vT, vB )] 

VUPKHSH_BE( vT, vB ), 

VUPKHSB BE( vT, vB ) 

VUPKHSH BE( vT, vB ) 

VUPKLSB BE( vT, vB ) 

VUPKLSH BE( vT, vB ) 



#endif 
/* 



(vB) .ul [i] ; \ 

/* end BUILD_MAX */ 
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#def ine VRSAVE_COND 7 /* recommended VR condition bit */ 

/* 

* macros to save and restore the CR register 
V 

ttdefine SAVE CR 
ttdefine REST_CR 

/* 

* macros to save and restore the LR register 
*/ 

#define SAVE LR 
ttdefine REST_LR 

/* 

* GET FPR SAVE AREA places the start of the FPR save area into a register 

* GET_GPR_SAVE__AREA places the start of the GPR save area into a register 
* 

* For MAX only: 
* 

* GET__VR_SAVE_AREA places the start of the VR save area into a register 
*/ 

l^. ttdefine GET GPR SAVE AREA( ptr ) \ 

f*^.; ptr = (long) ( ( (ulong) gpr__save_area + 15) & ~15) ; 



O ttdefine GET FPR SAVE AREA( ptr ) \ 

\Q ptr = (long) ( ( (ulong) fpr_save__area + 15) fie -15) ; 

#if defined ( BUILD MAX ) 
^B^ ttdefine GET VR SAVE AREA( ptr ) \ 

t§ ptr = (long) (( (ulong) vr_save_area + 15) & -15); 

? J; ttendi f 

/* 

U * macros to allocate and free space on the user stack. 

l.^H * For C implementation, the size is limited to 4096 bytes. 

^h- * / 

H= ttdefine PUSH STACK ( nbytes ) \ 

sp = (long) {{ (ulong) stack + 15) & -15); 

O ttdefine POP_STACK( nbytes ) \ 

fU sp = 0; 

ttdefine ALLOCATE STACK SPACE ( ptr, nbytes ) \ 
PUSH STACK ( nbytes ) \ 
ptr = sp; 



ttdefine FREE_STACK_SPACE { nbytes ) POP_STACK( nbytes ) 

ttdefine CREATE_STACK FRAME ( nbytes ) \ 
PUSH_STACK( nbytes ) 

ttdefine CREATE STACK FRAME X( nbytes ) \ 
CREATS_STACK__FRAME ( nbytes ) 

ttdefine DESTROY_STACK_FRAME \ 
sp = 0; 

ttdefine CREATE STACK BUFFER ( bufferp, byte__align, nbytes ) \ 
ALLOCATE_STACK_SPACE ( bufferp, nbytes ) 

ttdefine CREATE STACK BUFFER X( bufferp, byte_align, nbytes ) \ 
CREATE__STACK_BUFFER( bufferp, byte_align, nbytes ) 

ttdefine DESTROY__STACK_BUFFER \ 
sp = 0; 
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/* 

* macros to create salcache from the stack, used in ucode only 
*/ 

#def ine CREATE STACK SALCACHE \ 

char localcachebuffer [SALCACHE_ALLOC__SIZE] ; 

#define DESTR0y_STACK_SAl3CACHE 
/* 

* macros for saving and restoring non- volatile 

* floating point registers (FPRs) 
*/ 



m 
m 
w 

ill 





SAVE 


f 14 




t+Uci- -Lilt; 


SAVE 


f 14 


f 15 




Ori V Xij 


f 1 4. 


f 16 


^i^dsf ine 




■fx A 


-Ft 1 


^Idef ine 




-F 1 A 


■Ft q 


^Idef ine 






-Ft q 


#def ine 


SAVE 




f 2 0 


Tfaetme 


SAVE 


r 14 


■F o n 


ftaet me 


SAVE 


f 14 


f 22 


#def ine 


SAVE 


f 14 


f 23 


ttdef ine 


SAVE 


f 14 


f 24 


#def ine 


SAVE 


f 14 


r25 


#def ine 


SAVE 


f 14 


f 26 


ftderxne 


bAVE 


El4 


E*S / 


#def ine 


SAVE 


fl4 


f28 


#def ine 


SAVE 


fl4 


f29 


#def ine 


SAVE 


fl4 


f30 


#def ine 


SAVE_ 




^f31 


#def ine 


SAVE 


dl4 




#def ine 


SAVE 


dl4 


dl5 


#def ine 


SAVE 


dl4 


dl6 


#def ine 


SAVE 
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dl7 


#def ine 


SAVE 
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dl8 


#def ine 


SAVE 
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dl9 


#def ine 


SAVE 
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d20 


#def ine 


SAVE 
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#def ine 


SAVE 
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d22 


#def ine 


SAVE 


dl4 
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#def ine 


SAVE 


dl4 


d24 


#def ine 


SAVE 


dl4 


d25 


#def ine 


SAVE 


dl4 


d26 


#def ine 


SAVE 


dl4 


d27 


#def ine 


SAVE 


dl4 


d28 


#def ine 


SAVE 


dl4 


d29 


#def ine 


SAVE 


dl4 


d30 


#def ine 


SAVE 


dl4 


d31 



#define REST fl4 
#define REST fl4 fl5 
#define REST fl4 fl6 
#define REST fl4 fl7 
#define REST fl4 fl8 
#define REST fl4 fl9 
#define REST fl4 f20 
#define REST fl4 f21 
#define REST fl4 f22 
#define REST fl4 f23 
#define REST fl4 f24 
#define REST fl4 f25 
#define REST fl4 f26 
#define REST fl4 f27 
#define REST fl4 f28 
#define REST fl4 f29 
#define REST fl4 fSO 
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#def ine 


REST_ 


f 14 


f 31 


#def ine 


REST 


dl4 




#def ine 


REST 


dl4 


dl5 


#def ine 


REST 


dl4 


dl6 


#def ine 


REST 


dl4 


dl7 


#def ine 


REST 


dl4 


dl8 


#def ine 


REST 


dl4 


dl9 


#def ine 


REST 


dl4 


d20 




xtiijo J. 


dl4 


d21 


#def ine 


REST 


dl4 


d22 


#def ine 


REST 


dl4 


d23 


#def ine 


REST 


dl4 


d24 


#def ine 


REST 


dl4 


d2 5 


#def ine 


REST 


dl4 


d26 


#def ine 


REST 


dl4 


d27 


#def ine 


REST 


dl4 


d2 8 


#def ine 


REST 


dl4 


d29 


#def ine 


REST 


dl4 


d30 


#def ine 


REST 


dl4 


d31 


/* 









r*^ * macros for saving and restoring non-volatile 

fH- * general purpose registers (GPRs) 

S */ 

m 
m 



m 



#def ine 


SAVE 


rl3 




#def ine 


SAVE 


rl3 


rl4 


#def ine 


SAVE 


rl3 


rl5 


#def ine 


SAVE 


rl3 


rl6 


#def ine 


SAVE 


rl3 


rl7 


#def ine 


SAVE 


rl3 


rl8 


#def ine 


SAVE 


rl3 


rl9 


#def ine 


SAVE 


rl3 


r2 0 


#def ine 


SAVE 


rl3 


r21 


#def ine 


SAVE 


rl3 


r22 


#def ine 


SAVE 


rl3 


r23 


#def ine 


SAVE 


rl3 


r24 


#def ine 


SAVE 


rl3 


r25 


#def ine 


SAVE 


rl3 


r2 6 


#def ine 


SAVE 


rl3 


r27 


#def ine 


SAVE 


rl3 


r28 


#def ine 


SAVE 


rl3 


r29 


#def ine 


SAVE 


rl3 


r30 


#define 


SAVE_ 


-rl3^ 


_r31 


ttdefine 


REST 


rl3 




#d6f ine 


REST 


rl3 


rl4 


#def ine 


REST 


rl3 


rl5 


#def ine 


REST 


rl3 


rl6 


#def ine 


REST 


rl3 


rl7 


#def ine 


REST 


rl3 


rl8 


#def ine 


REST 


rl3 


rl9 


#def ine 


REST 


rl3 


r2 0 


#def ine 


REST 


rl3 


r21 


#def ine 


REST 


rl3 


r22 


#def ine 


REST 


rl3 


r23 


#def ine 


REST 


rl3 


r24 


#def ine 


REST 


rl3 


r25 


#def ine 


REST 


rl3 


r26 


#def ine 


REST 


rl3 


r27 


#def ine 


REST 


rl3 


r28 


#def ine 


REST 


rl3 


r29 


#define 


REST 


rl3 


r30 


#define 


REST_ 


_^rl3_ 


_r31 


#def ine 


SAVE 


rl4 




#def ine 


SAVE 


rl4 


rl5 
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#define SAVE rl4 rl6 
#define SAVE rl4 rl7 
#define SAVE rl4 rl8 
#define SAVE rl4 rl9 
#define SAVE rl4 r2 0 
#define SAVE rl4 r21 
#define SAVE rl4 r22 
#define SAVE rl4 r23 
#define SAVE rl4 r24 
#define SAVE rl4 r25 
#define SAVE rl4 r26 
#define SAVE rl4 r27 
#define SAVE rl4 r28 
#define SAVE rl4 r2 9 
#define SAVE rl4 r3 0 
#define SAVE_rl4_r31 

#define REST rl4 

#define REST rl4 rl5 

#define REST rl4 rl6 

#define REST rl4 rl7 

#define REST rl4 rl8 
%^ #define REST rl4 rl9 

#define REST rl4 r20 
M #define REST rl4 r21 

?3 #define REST rl4 r22 

-^f% ttdefine REST rl4 r23 

% #define REST rl4 r24 

'--D #define REST rl4 r25 

Ift #define REST rl4 r26 

?i #define REST rl4 r27 

#define REST rl4 r28 
i}^ #def ine REST rl4 r29 

5 ttdefine REST rl4 r30 

#define REST_rl4_r31 

W #define SAVE rl5 

14= #define SAVE rl5 rl6 

P ttdefine SAVE rl5 rl7 

'^P #def ine SAVE rl5 rl8 

13 #define SAVE rl5 rl9 

11= #define SAVE rl5 r20 

#define SAVE rl5 r21 
#define SAVE rl5 r22 
#define SAVE rl5 r23 
#define SAVE rl5 r24 
#define SAVE rl5 r25 
#define SAVE rl5 r26 
#define SAVE rl5 r27 
#define SAVE rl5 r28 
#define SAVE rl5 r29 
#define SAVE rl5 r30 
#define SAVE__rl5_r31 

#define REST rl5 
#define REST rl5 rl6 
#define REST rl5 rl7 
#define REST rl5 rl8 
#define REST rl5 rl9 
#define REST rl5 r20 
#define REST rl5 r21 
ttdefine REST rl5 r22 
#define REST rl5 r23 
#define REST rl5 r24 
^define REST rl5 r25 
^idefine REST rl5 r26 
^define REST rl5_r27 
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#define REST rl5 r28 
#define REST rl5 r29 
#define REST rl5 r30 
#define REST_rl5_r31 



C3 

m 
m 
w 



o 
w 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



SAVE 
SAVE 
SAVE 
SAVE 
SAVE 
SAVE 
SAVE 
SAVE 
SAVE 
SAVE 
SAVE 
SAVE 
SAVE 
SAVE 
SAVE 
SAVE 



rl6 
Tie 
rl6 
rl6 
rl6 
rl6 
rl6 
rl6 
Tie 
Tie 

Tie 
Tie 
Tie 
Tie 
Tie 

rl6 



Til 

rl8 
rl9 
r20 
r21 
r22 
r23 
r24 
r2 5 
r26 
r27 
r28 
r29 
r30 
rSl 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
ttdefine 
#def ine 
#def ine 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



REST 
REST 
REST 
REST 
REST 
REST 
REST 
REST 
REST 
REST 
REST 
REST 
REST 
REST 
REST 
REST 



rl6 
rl6 
rl6 
rl6 
rl6 
rl6 
Tie 
Tie 

Tie 
Tie 
Tie 
Tie 
Tie 
Tie 
Tie 
Tie 



Til 

rl8 
rl9 
r20 
t21 
t22 
r23 
r24 
r25 
r26 
r27 
r28 
r29 
r30 
r31 



* VMX registers 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
i define 
41 define 



USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 



THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 

THRU 



vO { 
vl( 
v2 { 
v3 ( 
v4 ( 
v5{ 
v6 { 
v7( 
v8 ( 
v9 { 
vlO ( 
vll { 
vl2 ( 
vl3 ( 
vl4 ( 
vl5( 
vl6 ( 
vl7 ( 
vl8( 
vl9 ( 
v20 ( 
v21( 
v22 ( 
v23{ 
v24 ( 



cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
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#define USE THRU v25 ( 
#define USE THRU v26 { 
#define USE THRU v27 ( 
#def ine USE THRU v28 ( 
#define USE THRU v29 ( 
#define USE THRU v3 0( 
#define USE_THRU_v3 1 ( 
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%0 

m 
m 
w 



1^ 



cond 
cond 
cond 
cond 
cond 
cond 
cond 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
ttdefine 
#define 
#define 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
ttdefine 
#def ine 

#endif 



FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 



THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 



vO ( 
vl( 
v2 ( 
v3{ 
v4 ( 
v5 ( 
v6 ( 
v7 ( 
v8 ( 
v9 ( 
vlO ( 
vll ( 
Vl2 ( 
vl3 ( 
vl4 ( 
vl5( 
vl6( 
vl7 { 
vl8 ( 
vl9( 
v20{ 
v21( 
v22 ( 
v23 ( 
v24( 
v25( 
v26( 
v27( 
v2 8 ( 
v29( 
v30( 
v31 ( 



cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 
cond 



/* end SALPPC_H */ 



END OF FILE salppc.h 
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#if J defined ( SAL.PPC_INC ) 
#define SALPPC INC 



#if 0 
+ 



*********************************************************** 

MC Standard Algorithms -- PPC Version ** 



*+ 
* 



**** 
*** 

************************************************************************** 
* * 

* 
* 



File Name: 
Description: 



salppc. inc 

SAL macro include file 



Source files should have extension .mac. For example, vadd.mac 
and must include this file (salppc. inc) . 

To assemble for PPC ucode, use the following basic 
makefile build rule: 

.SUFFIXES: .mac .c .s .o 

.mac .o: 

cp $*.mac $*.c 
ccmc -o $*.s -E $*.c 
ccmc -c -o $*.o $*.s 
rm -f $*.s 
rm -f $*.c 

To compile for C, use the following basic makefile build rule: 
.SUFFIXES: .mac .c .o 



. mac . o : 

cp $*.mac $*.c 
ccmc -DCOMPILE__C 
rm - f $ * , c 



-o $*.o 



The first 8 function arguments are passed in GPR registers 
r3 - rlO. Arguments beyond 8 are passed on the stack and may 
be obtained with the GET_ARG8, GET_ARG9, ... GET ARG15 macros. 
Additional GPR registers should be assigned in ascending order 
starting from the last f xinction argument . These may be declared 
with the DECLARE_rx[ ry] macros. For example, a function with 
5 arguments that requires 3 additional GPR registers would 
issue: DECLARE rS rlO. rO, if required, should be declared 
separately with the DECLARE rO macro. GPR registers above rl2 
must be saved and restored using the SAVE_rl3 [_ry] and 
REST_rl3 [_ry] macros, respectively. 

FPR registers should be assigned in ascending order starting 
with fO[dO]. These may be declared with the DECLARE_f 0 [_fy] 
or DECLARE dO [ dy] macros . 

For example, DECLARE fO fll. FPR registers above f 13 [dl3] must 
be saved and restored using the SAVE f 14 [ fy] and REST f 14 E_fy3 
or SAVE_dl4 [_dy] and REST_dl4 [_dy] macros, respectively. 

All variables must be assigned a register using the 
pre-processor #define directive. GPR registers are named 
rO - r31; Single precision FPR registers are named fO - f31. 
Double precision FPR registers are named dO - d31. Different 
variables may be assigned to the same register as in: 



#define vara 
#define varb 



fl2 
fl2 



Functions must begin with the FUNC_PROLOG macro and end 
with the FUNC EPILOG macro. 
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* 
* 


Macros are 


provided for both Fortran and C entry points. 


* 
* 


* 


The GET SALCACHE macro should be used to get the address of 


* 


* 
* 


the "current" 


salcache buffer into a GPR register. 


* 

* 


* 
* 


Avoid terminating macro lines with a semicolon. 


* 
* 


* 


The following 


example demonstrates typical usage: 


* 
* 


* 
* 


#include "salppc.inc" 


* 
★ 


* 


/* 






* 


* 


* assign variables to registers 


* 


* 


*/ 

/ 






* 






A 


r3 






tr vie J LXltr 


I 


r4 


* 


* 


ifcr^ o "F T "o 


B 


r5 


* 


* 


44- /^^"F "5 n ^ 


J 


r6 


* 


* 


TTviCl. J.J11C; 


C 


r7 


* 


* 


fFGLex. JLne 


K 


r8 


* 


* 


#Qer xne 


D 


r9 


* 


* 


ffQ.erj.ne 


L 


no 


* 




ftdef ine 


N 


rl2 




•k 


#def ine 


EFLAG rll 


* 


* 
* 


#def ine 


count rll 


* 
* 




#def ine 


to 


rl3 


* 




#def ine 


tl 


rl3 


* 


* 


#def ine 


t2 


rl4 


* 


* 


#def ine 


t3 


rl4 


* 


* 


#def ine 


t4 


rl5 


* 


* 


#def ine 


t5 


rl5 


* 


★ 
* 


#def ine 


t6 


rl6 


* 
* 




#def ine 


aO 


f 0 


* 


* 


#def ine 


al 


fl 


* 


* 


#def ine 


a2 


f2 


* 


* 


#def ine 


a3 


f3 


* 


•* 


#def ine 


bO 


f4 


* 


* 


#def ine 


bl 


f 5 


* 


* 


#def ine 


b2 


f6 


* 




#def ine 


b3 


f7 


* 


* 


#def ine 


cO 


f8 


* 


it 


# define 


cl 


f9 


* 


* 


#def ine 


c2 


flO 


* 


* 


#def ine 


c3 


fll 


* 


* 


#def ine 


dO 


fl2 


* 


* 


# define 


dl 


fl3 




* 


#def ine 


d2 


fl4 


* 


* 
* 


#def ine 


d3 


fl5 


* 
* 


* 
* 


FUNC_PROLOG 


/* must precede function */ 


* 
* 


* 


#if ldefined( 


COMPILE G ) 


* 


* 


U ENTRY (foo ) 


* 


* 


FORTRAN 


DREF 4(1, J, K, L) 


* 


* 
* 


FORTRAN, 


_DREF_ARG8 


* 
* 


* 


U ENTRY (foo) 


* 


■k 


LKEFLAG, 0) 


* 


* 
* 


BR (common) 




* 
* 


* 


U ENTRY (foo X ) 


* 




FORTRAN 


DREF 4 (I, J, L) 


* 


* 


FORTRAN 


DREP ARG8 


* 


* 


FORTRAN 


DREF ARG9 


* 


* 


#endif 






* 
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\0 



w 



/* get the 9'th arg (EFLAG) off stack */ * 



ENTRY 10{foo X, A. I, B, J, C, D, N, EFLAG) 
DECLARE rl3 rl6 
DECLARE fO fl5 
GET_ARG9( EFLAG ) 

LABEL { common) 

SAVE CR 
SAVE rl3 rl6 
SAVE fl4_fl5 
SAVE LR 



GET ARG8 ( N ) 



/* needed if using fields 2,3 or 4 */ 

/* needed if making a function call */ 
/* get the 8'th arg (N) off stack */ 



/^ 



body of function 



REST CR 
REST rl3 rl6 
REST fl4__fl5 
REST LR 
RETURN 



FUNC_EPILOG 



/* must conclude function */ 



Mercury Computer Systems, Inc. 
Copyright (c) 1996 All rights reserved 



* Revision 
* 

0.0 



Date 



Engineer; Reason 



0.1 



0.2 



0.3 
0.4 



960223 
970109 



970124 

970521 

980813 



icr; Created , ^ 

ifk- Added POSTING BUFFER COUNT and made 
' TEST IF DCBZ macro time "stw" instead 
of doing the TEST IF DCBT macro (Iwz) 
ifk; Added SALCACHE ALLOC SIZE ; 
^ ALIGN SALCACHE, CREATE^SALCACHE^FRAME 

DESTROY SALCACHE FRAME 
jfk; Added SET DCB [TZ] COND macros. 
Made old macros not assemble 

■■ /* header */ 

#endif 

#endif 



* 
* 



define single precision floating point field sizes, 
limits, and values 



*/ 

#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
#def ine 

/ 



F FLOAT SIZE 32 
F FRAC SIZE 23 
F HIDDEN SIZE 1 
F EXP SIZE 8 
F SIGN SIZE 
F SIGN BIT 
F EXP MASK 
F EXP BIAS 
F MAX EXP 
F MIN EXP 



(F FLOAT SIZE - F SIGN SIZE) 
((1 « F EXP SIZE) - 1) 
{(1 « (F__EXP_SIZE-1) ) - 1) 
F EXP BIAS 
(- (F_EXP_BIAS-1) ) 



* define double precision floating point field sizes, 

* limits, and values 
*/ 

ttdefine D_FLOAT__SIZE 64 
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#define D FRAC SIZE 52 
#define D HIDDEN SIZE 1 
#define D EXP SIZE 11 
#define D SIGN SIZE 1 
#define D SIGN BIT 
#define D EXP MASK 
#define D EXP BIAS 
#def ine D MAX EXP 
#def ine D MIN EXP 



(D FLOAT SIZE - D SIGN SIZE) 
( (1 << D EXP SIZE) - 1) 
((1 << (D_EXP_SIZE-1) ) - 1) 
D EXP BIAS 
(- (D EXP BIAS-1) ) 



#if defined ( BUILD_603 ) 

#define L0G2_CACHE_SI2E (14) /* Log (base 2) of 603 data cache */ 
#elif defined ( BUILD_750 ) | | defined ( BUILD_MAX ) 

#define L0G2 CACHE SIZE (15) /* Log (base 2) of 750 or MAX data cache 
*/ 

#endif 



m 
m 



w 
o 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
# define 
#def ine 

#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 

#def ine 
#def ine 
#def ine 
#def ine 
#def ine 

#def ine 
#def ine 
#def ine 



L0G2 
LOG2 
L0G2 
L0G2 
L0G2 
L0G2 
L0G2 



CACHE 
CACHE 
CACHE 
CACHE 
CACHE 
CACHE 
CACHE 



BSIZE 
HSIZE 
LSIZE 
FSIZE 
DSIZE 
CSIZE 
ZSIZE 



(L0G2 
(L0G2 
(L0G2 
(L0G2 
{L0G2 
(L0G2 
{L0G2 



CACHE SIZE) 
CACHE SIZE 
CACHE SIZE 
CACHE SIZE ■ 
CACHE SIZE ■ 
CACHE SIZE ■ 
CACHE SIZE ■ 



1) 
2) 
2) 
3) 
3) 
4) 



CACHE SIZE 
CACHE BSIZE 
CACHE HSIZE 
CACHE LSIZE 
CACHE FSIZE 
CACHE DSIZE 
CACHE CSIZE 
CACHE ZSIZE 



(1 << L0G2 CACHE_SIZE) 

(CACHE SIZE) 

(CACHE SIZE >> 1) 

(CACHE SIZE >> 2) 

(CACHE SIZE » 2) 

(CACHE SIZE >> 3) 

(CACHE SIZE » 3) 

(CACHE SIZE » 4) 



L0G2 CACHE LINE_SIZE 5 

CACHE LINE SIZE (1 << L0G2 CACHE_LINE SIZE) 
CACHE LINE LSIZE (CACHE LINE SIZE » 2) 
CACHE LINE MASK (CACHE LINE SIZE - 1) 
CACHE_LI NE_ADDR_MAS K (OxffffffeO) 

LOG2 SALCACHE ALIGN 6 

SALCACHE ALIGN (1 << L0G2 SALCACHE ALIGN) 
SALCACHE ALIGN MASK (SALCACHE ALIGN - 1) 



#define SALCACHE SIZE 
#define SALCACHE EXTRA SIZE 
#define SALCACHE ALLOC SIZE 



CACHE SIZE 

(SALCACHE ALIGN + 64) 

(SALCACHE SIZE + SALCACHE EXTRA SIZE) 



Define memory vector non-cache (N) / cache (C) FLAG values for 
Enhanced SAL calls (final argument) . The letters in the symbol 
correspond to the vectors in the call, moving from left to right 
so, for example: 

for VMULX, there are the following 8 possibilities: 



* 


VMULX 


(A, 


I, 


B, 


J, 


C, 


K, 


N, 


SAL NNN) 


A, 


B, C all not in 


cache 


* 


VMULX 


(A, 


I, 


B. 


J, 


C, 




N, 


SAL NNC) 


A, 


B not 


in cache. 


C 


in 


cache 


* 


VMULX 


(A, 


I, 


B, 


J. 


c. 


K, 


N, 


SAL NCN) 


A, 


C not 


in cache. 


B 


in 


cache 


* 


VMULX 


(A, 


I, 


B, 


J. 


c. 


K, 


N, 


SAL NCC) 


A 


not in 


cache, B, 


C 


in 


cache 


* 


VMULX 


(A, 


1, 


B, 


J, 


c. 


K, 


N, 


SAL CNN) 


B, 


C not 


in cache. 


A 


in 


cache 


* 


VMULX 


(A, 


Ir 


B, 


J, 


c. 


K, 


N, 


SAL CNC) 


B 


not in 


cache. A, 


C 


in 


cache 


* 


VMULX 


(A, 


I, 


B, 


J, 


c. 


K, 


N, 


SAL CCN) 


C 


not in 


cache. A, 


B 


in 


cache 
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* VMULX (A, I, B, J, C, K, N, SAL_CCC) A, B, C all in cache 
*/ 

/* 

* 1 vector algorithms 
*/ 

#define SAL N 0 
#define SAL_C 1 

/* 

* 2 vector algorithms 
*/ 

#define SAL NN 0 

#define SAL NC 1 

#define SAL CN 2 

#define SAL_CC 3 

/* 

* 3 vector algorithms 
*/ 



Cfl 



w 
1^ 



#define 


SAL 


NNN 


0 


#define 


SAL 


NNC 


1 


#def ine 


SAL 


NCN 


2 


#def ine 


SAL 


NCC 


3 


#def ine 


SAL 


CNN 


4 


#def ine 


SAL 


CNC 


5 


#def ine 


SAL 


CCN 


6 


#def ine 


SAL 


CCC 


7 


/* 








* 4 vector algorithms 


*/ 








#def ine 


SAL 


NNNN 


0 


#def ine 


SAL 


NNNC 


1 


#def ine 


SAL 


NNCN 


2 


#def ine 


SAL 


NNCC 


3 


#def ine 


SAL 


NCNN 


4 


#def ine 


SAL 


NCNC 


5 


#def ine 


SAL 


NCCN 


6 


#def ine 


SAL 


NCCC 


7 


#def ine 


SAL 


CNNN 


8 


#def ine 


SAL 


CNNC 


9 


#def ine 


SAL 


CNCN 


10 


#def ine 


SAL 


CNCC 


11 


#def ine 


SAL 


CCNN 


12 


#def ine 


SAL 


CCNC 


13 


#def ine 


SAL 


CCCN 


14 


#def ine 


SAL 


CCCC 


15 


/* 









* 5 vector 


algorithms 


#def ine 


SAL 


NNNNN 


0 


#def ine 


SAL 


NNNNC 


1 


#def ine 


SAL 


NNNCN 


2 


#define 


SAL 


NNNCC 


3 


#def ine 


SAL 


NNCNN 


4 


#def ine 


SAL 


NNCNC 


5 


#def ine 


SAL 


NNCCN 


6 


#def ine 


SAL 


NNCCC 


7 


#def ine 


SAL 


NCNNN 


8 


#def ine 


SAL 


NCNNC 


9 


#def ine 


SAL 


NCNCN 


10 


#def ine 


SAL 


NCNCC 


11 


#def ine 


SAL 


NCCNN 


12 


#def ine 


SAL 


NCCNC 


13 


#def ine 


SAL 


NCCCN 


14 
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#def ine 


SAL 




1 e; 


#def ine 


SAL 




1 <^ 

-L O 


#def ine 


SAL 


P'NrNTNTP 


1 7 


#def ine 


SAL 




J. O 


#def ine 


SAL 




1 Q 

J. J7 


#def ine 


SAL 




20 


#def ine 


SAL 






#def ine 


SAL 


CNCCN 


22 


#def ine 


SAL 


CNCCC 


23 


#def ine 


SAL 


CCNNN 


24 


#def ine 


SAL 


came 


25 


#def ine 


SAL 


CCNCN 


26 


#define 


SAL 


CCNCC 


27 


#define 


SAL 


CCCNN 


28 


#def ine 


SAL 


CCCNC 


29 


#def ine 


SAL 


CCCCN 


30 


#def ine 


SAL 


CCCCC 


31 
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/* 

* 



define byte offsets into FFT_setup_ppc603e 



m 
m 
w 



#def ine 


FFT 


SETUP 


HANDLE 


0 


#def ine 


FFT 


SETUP 


SMALL TWIDP 


4 


#def ine 


FFT 


SETUP 


SMALL BITR TWIDP 


8 


#def ine 


FFT 


SETUP 


SMALL L0G2M 


12 


#def ine 


FFT 


SETUP 


BIG TWIDP 


16 


#def ine 


FFT 


SETUP 


BIG XY TWIDP 


20 


#def ine 


FFT 


SETUP 


BIG L0G2MXY 


24 


#def ine 


FFT 


SETUP 


BIG L0G2X 


28 


#def ine 


FFT 


SETUP 


BIG L0G2Y 


32 


#def ine 


FFT 


SETUP 


BIG STRIPX 


36 


#def ine 


FFT 


SETUP 


RPASS TWIDP 


40 


#def ine 


FFT 


SETUP 


RADIX3 TWIDP 


44 


#def ine 


FFT 


SETUP 


RADIX5 TWIDP 


48 


#def ine 


FFT 


SETUP 


L0G2M 


52 


#def ine 


FFT 


SETUP 


LOG2MR 


56 


#def ine 


FFT 


SETUP 


VMX BITR TWIDP 


60 


#def ine 


FFT 


SETUP 


VMX TABLES 


64 


/* 











* ASIC equates 
*/ 

#define ASIC_H 

#define PREFETCH CONTROL 

#define PREFETCH CONTROL H 

#def ine PREFETCH_CONTROL_L 

#def ine MISCON B 

#def ine MISCON B H 

#def ine MISCON B L 



#def ine 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
#def ine 
#def ine 

#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



PREFETCH DISABLED 
PREFETCH AUTO 6 
PREFETCH AUTO 5 
PREFETCH AUTO 4 
PREFETCH AUTO 3 
PREFETCH AUTO 2 
PREFETCH AUTO 1 
PREFETCH_AUTO_0 

PREFETCH MANUAL 0 

PREFETCH MANUAL 2 

PREFETCH MANUAL 4 

PREFETCH MANUAL 6 

PREFETCH MANUAL 8 

PREFETCH MANUAL 10 



-1024 

(OxFBFFFEOO) 

-1024 

-512 

(0xFBFFFC18) 

-1024 

-1000 

0 
1 
2 
3 
4 
5 
6 
7 

8 
9 

10 
11 
12 
13 



/* (OxFBFF + 1) */ 



/* (OxFBFF + 1) */ 
/* (OxFEOO) */ 



/* (OxFBFF + 1) */ 
/* (OxFClS) */ 
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#def ine PREFETCH MANUAL 12 14 

#define PREFETCH_MANUAL__14 15 

#define USE PREFETCH_CONTROL 16 

#define USE MISCON B 0 



3/9/2001 



#def ine PREFETCH MASK 



15 



m 



y 



#def ine PREFETCH DEFAULT (USE 

#define PREFETCH OFF (USE 

#define PREFETCH A6 (USE 

#define PREFETCH A5 (USE 

#def ine PREFETCH A4 (USE 

#def ine PREFETCH A3 (USE 

#define PREFETCH A2 (USE 

#def ine PREFETCH Al (USE 

#de fine PREFETCH_AO ( USE_ 

#define PREFETCH MO (USE 

#define PREFETCH M2 (USE 

#define PREFETCH M4 (USE 

#define PREFETCH M6 (USE 

#define PREFETCH M8 (USE 

#define PREFETCH MIO (USE 

#define PREFETCH M12 (USE 

#define PREFETCH_M14 {USE 



PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
_PREFETCH_ 

PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 



CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 

CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 



PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH^ 

PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 



MANUAL 0) 
DISABLED) 
AUTO 6) 
AUTO 5) 
AUTO 4) 
AUTO 3) 
AUTO 2) 
AUTO 1) 
AUTO 



_0) 



MANUAL 0) 
MANUAL 2) 
MANUAL 4) 
MANUAL 6) 
MANUAL 8) 
MANUAL 10) 
MANUAL 12) 
MANUAL_14) 



/* 



macro to compile for PPC assembly (COMPILE C *not* defined) or 
C code (COMPILE_C defined) ~ 



*/ 

#if defined ( COMPILE_C ) 
#include " salppc. h" 
#else 



GPR register equates 



m 



#def ine 


rO 


0 


#def ine 


sp 


1 


#def ine 


rtoc 


2 


#def ine 


r3 


3 


#def ine 


r4 


4 


#def ine 


r5 


5 


#def ine 


r6 


6 


#def ine 


r7 


7 


#def ine 


r8 


8 


#def ine 


r9 


9 


#def ine 


rlO 


10 


#def ine 


rll 


11 


#def ine 


rl2 


12 


#def ine 


rl3 


13 


#def ine 


rl4 


14 


#def ine 


rl5 


15 


#def ine 


rl6 


16 


#def ine 


rl7 


17 


#def ine 


rl8 


18 


#def ine 


rl9 


19 


#def ine 


r20 


20 


#def ine 


r21 


21 


#def ine 


r22 


22 


#def ine 


r23 


23 


#def ine 


r24 


24 


#def ine 


r25 


25 


#def ine 


r26 


26 



7 
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1^ 



#def ine 


r27 


27 


#define 


r28 


28 


#def ine 


r29 


29 


#def ine 


r30 


30 


#def ine 


r31 


31 


/* 






* FPR single pi 


* / 










n 
\j 


TTUCX. JLXiCS 


f 1 


1 

X 


itHo'F n no 


f 2 




ttUc J. 










A 

TE 


JtHo-F-i no 
J. 


J. 0 


D 


^define 


r 0 




ffu.ei.me 




■7 
/ 


ffaerxne 


-FQ 




#def ine 


r y 


9 


#def ine 


tio 


10 


ttdef ine 


til 


11 


ffderine 


r 12 


12 


#def ine 


fl3 


13 


#def ine 


fl4 


14 


ffaexme 


tl5 


15 


#def ine 


fl6 


16 


#def ine 


fl7 


17 


#def ine 


flB 


18 


#def ine 


fl9 


19 


#def ine 


f20 


20 


#def ine 


f21 


21 


#def ine 


f22 


22 


#def ine 


f23 


23 


#def ine 


f24 


24 


#def ine 


f25 


25 


#def ine 


±26 


26 


#def ine 


f27 


27 


#def ine 


f28 


28 


#def ine 


f29 


29 


#def ine 


f30 


30 


#def ine 


f31 


31 


/* 






* FPR 


double pr 


*/ 






#def ine 


do 


0 


#def ine 


dl 


1 


#def ine 


d2 


2 


#def ine 


d3 


3 


#def ine 


d4 


4 


#def ine 


d5 


5 


#def ine 


d6 


6 


#def ine 


d7 


7 


#def ine 


d8 


8 


#def ine 


d9 


9 


#def ine 


dlO 


10 


#def ine 


dll 


11 


#def ine 


dl2 


12 


#def ine 


dl3 


13 


#def ine 


dl4 


14 


#def ine 


dl5 


15 


#def ine 


dl6 


16 


#def ine 


dl7 


17 


#def ine 


dl8 


18 


#def ine 


dl9 


19 


#def ine 


d20 


20 


#def ine 


d21 


21 
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#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



d22 
d2 3 
d24 
d2 5 
d2 6 
d2 7 
d2 8 
d2 9 
d3 0 
d31 



22 
23 
24 
25 
26 
27 
28 
29 
30 
31 



#if defined ( BUILD MAX ) 



CO 

Ifl 
y 

y 



lip 



/* 

* VMX 

*/ 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
#def ine 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



(g4) register equates 



vO 

vl 

v2 

v3 

v4 

v5 

v6 

v7 

v8 

v9 

vlO 

vll 

vl2 

vl3 

vl4 

vl5 

vl6 

vl7 

vl8 

vl9 

v2 0 

v21 

v22 

v23 

v24 

v25 

v26 

v27 

v28 

v2 9 

v3 0 

v31 



0 
1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 

14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 



#endif 

#define FUNC PROLOG \ 
.section .text; \ 
.align 5; 

#define FXJNC_EPILOG 

#define TEXT SECTION ( logb2_align ) \ 
.section .text; \ 
.align logb2_align; 

#define DATA SECTION ( logb2_align ) \ 
.section .data; \ 
.align logb2__align; 

#define RODATA SECTION ( logb2_align ) \ 
.section .rodata; \ 
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.align logb2_align; 

#define PC_OFFSET{ nbytes ) (. + (xibytes) ) 
/* 

* make a "double" concat to fool the preprocessor so that input 

* arguments get translated before concatenation; otherwise, the 

* concatenated symbol doesn't get translated properly 
*/ 

#define CONCAT { left, right ) CONCAT NEST( left, right ) 
#define CONCAT_NEST( left, right ) left##right 

/* 

* macro for extern declarations and definitions 
*/ 

#define EXTERN_DATA( symbol ) 
#define EXTERN_FUNC ( func ) 

/* 

* macro for a global declaration 
*/ 

#define GLOBAL ( symbol ) \ 
1^. .globl symbol 

O * macro for a local declaration 

:% #define LOCAL ( symbol ) 



/ 



* 



m * macros for creating static arrays 

*/ 

#define START_ARRAY( name ) \ 
s name## : 

#define START C ARRAY ( name ) START ARRAY ( name ) 

W #define START UC ARRAY { name ) START ARRAY ( name ) 

I*?: #define START S ARRAY ( name ) START ARRAY ( name ) 

p tdefine START US ARRAY ( name ) START ARRAY ( name ) 

!r #define START L ARRAY { name ) START ARRAY ( name ) 

13 #define START UL ARRAY ( name ) START ARRAY ( name ) 

fli #define START F ARRAY { name ) START_ARRAY ( name ) 

#define END_ARRAY 

#define DATA( type, dl ) \ 
.##type dl 

#define DATA2 ( type, dl, d2 ) \ 
.##type dl, d2 

#define DATA4 { type, dl, d2, d3, d4 ) \ 
.*i#type dl, d2, d3, d4 

ttdefine DATA8 ( type, dl, d2, d3, d4, d5, d6, d7, d8 ) \ 
.##type dl, d2, d3, d4, d5, d6, d7, d8 

#define C DATA( dl ) DATA( byte, dl ) 

#define UC DATA( dl ) DATA{ byte, dl ) 

#define S DATA( dl ) DATA( short, dl ) 

#define US DATA( dl ) DATA{ short, dl ) 

#define L DATA ( dl ) DATA( long, dl ) 

#define UL DATA( dl ) DATA( long, dl ) 

#define F_DATA( dl ) DATA( float, dl ) 

#if defined ( LITTLE_ENDIAN ) 



10 
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liJ 



fll 



#define D_DATA( dl, d2 ) 

#define D_DATA( dl, d2 ) 

#endif 

#define C DATA2 ( dl, d2 ) 
#define UC DATA2 ( dl, d2 ) 
#define S DATA2 ( dl , d2 ) 
#define US DATA2 ( dl, d2 ) 
#define L DATA2 ( dl, d2 } 
#define UL DATA2 { dl, d2 ) 
#define F DATA2 ( dl , d2 ) 



DATA2( long, d2, dl ) 
DATA2{ long, dl, d2 ) 



DATA2( byte, dl, d2 ) 
DATA2 { byte , dl , d2 ) 
DATA2 { short , dl , d2 ) 
DATA2 ( short , dl , d2 ) 
DATA2 ( long, dl, d2 ) 
DATA2 ( long, dl , d2 ) 
DATA2( float, dl, d2 ) 



#define C DATA4 ( dl, d2, d3, d4 ) 
#define UC DATA4 ( dl, d2, d3 , d4 ) 
ttdefine S DATA4 ( dl, d2, d3, d4 ) 
ttdefine US DATA4 ( dl, d2 , d3, d4 ) 
#define L DATA4 ( dl, d2, d3, d4 ) 
#define UL DATA4 { dl, d2, d3, d4 ) 
ttdefine F DATA4 ( dl, d2, d3, d4 ) 



DATA4( byte, dl, d2, d3, d4 ) 
DATA4{ byte, dl, d2, d3 , d4 ) 
DATA4( short, dl, d2, d3, d4 ) 
DATA4{ short, dl, d2, d3, d4 ) 
DATA4( long, dl, d2, d3, d4 ) 
DATA4( long, dl, d2, d3, d4 ) 
DATA4( float, dl, d2 , d3, d4 ) 



#def ine 
#def ine 
#define 
#def ine 
#def ine 
#def ine 
#define 



C DATA8( dl, 
DATA8( byte, 
UC DATA8( dl 
DATA8( byte, 
S DATA8{ dl, 
DATA8 ( short 
US DATA8{ dl 
DATAB ( short 
L DATA8 ( dl , 
DATA8 ( long, 
UL DATA8( dl 
DATA8( long, 
P DATA8{ dl, 
DATA8{ float 



d2, d3, 
dl, d2, 

, d2, d3 
dl, d2, 
d2, d3, 

, dl, d2 



d2, 
dl. 



d3 
d2 



d2, d3, 
dl, d2, 

d2, d3 
dl, d2, 
d2, d3, 

dl, d2 



a4, 

d3, 
, d4, 
d3, 
d4, 
. d3, 
, d4, 
, d3, 
d4, 
d3, 
, d4, 
dS, 
d4, 
, d3. 



d5, d6, 
d4, d5, 

d5, d6. 
d4, d5, 
d5, d6, 
d4, d5, 
d5, d6, 
d4, d5, 
d5, d6, 
d4, d5, 

d5, d6, 
d4, d5, 
d5, d6, 
d4, d5, 



d7, d8 ) \ 
d6, d7, d8 ) 
d7, d8 ) \ 
d6, d7, d8 ) 
d7, d8 ) \ 
d6, d7, d8 ) 
d7, d8 } \ 
d6, d7, d8 ) 
d7, d8 ) \ 
d6, d7, d8 ) 
d8 ) \ 
d7, d8 ) 



d7 
d6, 



d7, d8 ) \ 
d6, d7, d8 ) 



/* 



macros for creating vmx permute masks (12 8 -bits) 



*/ 

#if defined ( LITTLE ENDIAN ) 



#def ine 


L 


PERMUTE MUNGE( 


1 


) 


( (1) 


^ Oxlclclclc 


#def ine 


S 


PERMUTE MUNGE{ 


s 


) 


( (s) 


^ Oxlele ) 


ttdefine 


C_ 


_PERMUTE_MUNGE ( 


c 


) 


( (c) 


^ Oxlf ) 


#def ine 


L 


INDEX MUNGE( x 


) 


( 


(x) - 


0x3 ) 


#def ine 


S 


INDEX MUNGE( x 


) 


( 


(X) - 


0x7 ) 


#def ine 


C_ 


INDEXjyiUNGE ( x 


) 


( 


(X) - 


Oxf ) 


#else 














#def ine 


L 


PERMUTE MUNGE( 


1 


) 


{ 1 ) 




#def ine 


S 


PERMUTE MUNGE{ 


s 


) 


{ s ) 




#def ine 


C_ 


_PERMUTE_iyiUNGE { 


c 


) 


( c ) 




#def ine 


L 


INDEX MUNGE{ x 


) 


( 


X ) 




#def ine 


S 


INDEX MUNGE( x 


) 


( 


X ) 




#def ine 


G 


INDEX MUNGE( x 


) 


( 


X ) 





#endif 

#define L PERMUTE MASK( 11, 
.long L PERMUTE MUNGE ( 11 ) 
L PERMUTE MUNGE ( 13 ) 



12, 13, 14 ) \ 
, L PERMUTE MUNGE ( 12 ) 
, L PERMUTE MUNGE ( 14 ) 



#define S PERMUTE MASK( si, s2, s3, s4 , s5, s6, s7, s8 ) \ 
.short S PERMUTE MUNGE ( si ) , S PERMUTE MUNGE ( S2 ) , \ 



11 
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S PERMUTE MUNGEC s3 ) , S PERMUTE MUNGE { S4 ) , \ 
S PERMUTE MUNGE ( s5 ) , S PERMUTE MUNGE ( s6 ) , \ 
S PERMUTE MUNGE ( s7 ), S_PERMUTE_MUNGE ( s8 ) 



.byte 



#define C_PERMUTE_MASK ( cl, 

c9 , 

PERMUTE MUNGE ( cl ) 
PERMUTE MUNGE ( c3 ) 
PERMUTE MUNGE ( c5 ) 
PERMUTE MUNGE ( c7 ) 
PERMUTE MUNGE ( c9 ) 
PERMUTE MUNGE ( cll 
PERMUTE MUNGE ( Cl3 
PERMUTE MUNGE ( Cl5 



c2, c3, c4, c5, c6 , cl, c8, \ 
clO, cll/ cl2, Cl3, Cl4, cl5, 
, C PERMUTE MUNGE ( c2 ) , \ 
, C PERMUTE MUNGE ( c4 ) , \ 
, C PERMUTE MUNGE ( c6 ) , \ 
, C PERMUTE MUNGE ( c8 ) , \ 
, C PERMUTE MUNGE ( clO ) , \ 
) , C PERMUTE MUNGE ( cl2 ) , \ 
) , C PERMUTE MUNGE ( Cl4 ) , \ 
), C PERMUTE MUNGE ( cl6 ) 



cl6 ) \ 



macro for a microcode entry point 
U ENTRY is a "nop" for C code 



{ e . g * vaddx , vaddx_) 



*/ 

#define U ENTRY ( func_name ) \ 
. globl f unc_name ; \ 
func name: 



C3 



m 



5 



for C function prototypes 



PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 



9 
10( 
11( 

12 ( 

13 ( 

14 ( 

15 { 

16 ( 



func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 



/* 

* macros 
*/ 

#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 

#define _ _ _ 

/* 

* macros for C and Fortran callable entry points 
*/ 

#def ine ENTRY 0 ( f unc_name ) \ 
.globl func_name; \ 
f unc_name : 

#define ENTRY 1( func_name, argO ) \ 
.globl func_name; \ 
f unc_name : 

#define ENTRY 2( func_name, argO, argl ) \ 
. globl f unc_name ; \ 
f unc_name : 

#define ENTRY 3( func_name, argO, argl, arg2 ) \ 
. globl f unc_name ; \ 
f unc_name : 

#define ENTRY 4( func_name, argO, argl, arg2, arg3 ) \ 
. globl f unc__name ; \ 
func name: 
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#define ENTRY 5{ func__name, argO, argl, arg2, arg3, arg4 ) \ 
.globl func_name; \ 
f unc_name : 

#define ENTRY 6( func_name, argO, argl, arg2, arg3, arg4, argS ) \ 
.globl func_name; \ 
f unc_name : 

#define ENTRY_7 { func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6 ) \ 

.globl func_name; \ 
f unc_naTne : 

#define ENTRY_8 { fiinc_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7 ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_9 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, argS ) \ 

.globl fiinc_name; \ 
f unc_name : 

1^- #define ENTRY__10 ( func__name, argO, argl, arg2, arg3 , arg4, arg5, \ 

13 arg6, arg7, argS, arg9 ) \ 

fs^ .globl f \inc_name ; \ 

f unc name : 

%o 

^gl #define ENTRY_11 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, argS, arg9, arglO ) \ 

.globl f\mc_name; \ 
10 f unc_name : 

ill 

#define ENTRY_12 { func_name, argO, argl, arg2, arg3, arg4 , arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil ) \ 

13 .globl func_name; \ 

Is^ func name: 

— 

M. =.. 

#define ENTRY_13 { func_name, argO, argl, arg2, arg3, arg4, arg5, \ 
*P arg6, arg7, argS, arg9, arglO, argil, \ 

fH argl2 ) \ 

JJ^^ .globl func_name; \ 

I y-' f unc_name : 

#define ENTRY_14 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, argS, arg9, arglO, argil, \ 
argl 2, argl 3 ) \ 

. globl f unc_name ; \ 
f unc__name : 

#define ENTRY_15 ( fiinc_name, argO, argl, arg2, arg3, arg4, argS, \ 

arg6, arg7, argS, arg9, arglO, argil, \ 
argl2, argl3, argl4 ) \ 

. globl f unc_name ; \ 
f unc_name : 

#define ENTRY_16 ( func_name, argO, argl, arg2, arg3, arg4 , arg5, \ 

arg6, arg7, argS, arg9, arglO, argil, \ 
argl2, argl3, argl4, argl5 ) \ 

. globl f unc_name ; \ 
f unc_name : 

/* 

* macros to de- reference any set of the first 8 arguments 

* passed by reference to the Fortran entry point but by 

* value to the corresponding C entry point 
*/ 
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#define FORTRAN DREF 1( argO ) \ 
Iwz argO, 0 (argO) ; 

#define FORTRAN DREF 2( argO, argl ) \ 

Iwz argO^ 0 (argO) ; \ 
Iwz argl, 0 (argl) ; 

#define FORTRAN DREF 3{ argO, argl, arg2 ) \ 
Iwz argO, 0 (argO) ; \ 
Iwz argl, 0 (argl) ; \ 
Iwz arg2, 0(arg2); 

#define FORTRAN DREF 4( argO, argl, arg2, arg3 ) \ 
Iwz argO, O(argO); \ 
Iwz argl, 0 (argl) ; \ 
Iwz arg2, 0(arg2); \ 
Iwz arg3, 0(arg3); 

#define FORTRAN DREF 5( argO, argl, arg2, arg3, arg4 ) \ 
Iwz argO, 0{argO); \ 



I; 



Iwz argl, 0 (argl) ; \ 

Iwz arg2, 0(arg2); \ 

Iwz arg3, 0(arg3); \ 

Iwz arg4, 0{arg4); 

#define FORTRAN DREF 6( argO, argl, arg2, arg3, arg4, argS ) \ 
Iwz argO, 0 (argO) ? \ 



^0 Iwz argl, O(argl); \ 

Q Iwz arg2, 0 (arg2) ; \ 

Iwz arg3, 0(arg3); \ 

Iwz arg4, 0 (arg4) ; \ 

Iwz arg5, 0(arg5); 

W 

#define FORTRAN DREF 7( argO, argl, arg2, arg3, arg4, arg5, arg6 ) \ 

Iwz argO, O(argO); \ 

Iwz argl, O(argl); \ 

III Iwz arg2, 0(arg2); \ 

Iwz arg3, 0{arg3); \ 

Iwz arg4 , 0 {arg4) ; \ 

Iwz argS, 0(arg5); \ 

l^i Iwz arg6, 0(arg6); 



) \ 



J- vw ^ iA.j-'^yj / \y \ ^A. J- \j / f 

#define FORTRAN DREF 8( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7 
Iwz argO, O(argO); \ 
Iwz argl, O(argl); \ 
Iwz arg2, 0{arg2); \ 
Iwz arg3, 0(arg3); \ 
Iwz arg4, 0(arg4); \ 
Iwz argS, 0 (argS) ; \ 
Iwz arg6, 0 (arg6) ; \ 
Iwz argV, 0 (arg7) / 

/* 

* macros to de- reference specific arguments beyond the first 8 

* passed by value to the C entry point 
*/ 

#define ARG_OFF (8 - 8*4) 

#def ine FORTRAN DREF_ARG8 \ 

Iwz rl2, (ARG OFF + 8*4) (sp) ; \ 

Iwz rl2, 0 (rl2) ; \ 

stw rl2, (ARG_OFF + 8*4) (sp) ; 

#def ine FORTRAN DREF_ARG9 \ 

Iwz rl2, (ARG OFF + 9*4) (sp) ; \ 

Iwz rl2, 0 (rl2) ; \ 

stw rl2, (ARG_OFF + 9*4) (sp) ; 
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#define FORTRAN DREF_ARG10 \ 

Iwz rl2, (ARG OFF + 10*4) (sp) ; \ 

Iwz rl2, 0(rl2) ; \ 

stw rl2, (ARG_OFF + 10*4) (sp) ; 

#define FORTRAN DREF_ARG11 \ 

Iwz rl2, (ARG OFF + 11*4) (sp) ; \ 

Iwz rl2, 0 (rl2) ; \ 

stw rl2, (ARG_OFF + 11*4) (sp) ; 

#define FORTRAN DREF_ARG12 \ 

Iwz rl2, (ARG OFF + 12*4) (sp) ; \ 

Iwz rl2, 0 (rl2) ; \ 

Stw rl2, (ARG_OFF + 12*4) (sp) ; 

#def ine FORTRAN DREF_ARG13 \ 

Iwz rl2, (ARG OFF + 13*4) (sp) ; \ 

Iwz rl2, 0(rl2) ; \ 

stw rl2, (ARG_OFF + 13*4) (sp) ; 



m 
m 

y 

fU 



#def ine FORTRAN DREF_ARG14 \ 

Iwz rl2, (ARG OFF + 14*4) (sp) ; \ 

Iwz rl2, 0 {rl2) ; \ 

stw rl2, (ARG_OFF + 14*4) (sp) ; 

#def ine FORTRAN DREF_ARG15 \ 

Iwz rl2, (ARG OFF + 15*4) (sp) ; \ 

Iwz rl2, 0(rl2) ; \ 

stw rl2, (ARG_OFF + 15*4) (sp) ; 

#define FORTRAN DREF_ARG16 \ 

Iwz rl2, (ARG OFF + 16*4) (sp) ; \ 

Iwz rl2, 0 (rl2) ; \ 

stw rl2, (ARG_OFF + 16*4) (sp) ; 

#def ine FORTRAN DREF_ARG17 \ 

Iwz rl2, (ARG OFF + 17*4) (sp) ; \ 

Iwz rl2, 0 (rl2) ; \ 

stw rl2, (ARG_OFF + 17*4) (sp) ; 

/* 

* macros to get GPR arguments beyond 8 



#def ine 


GET ARG 8 ( 


rD ) 




Iwz 


rD, 


(ARG 


OFF 


+ 


8*4) (sp) ; 


#def ine 


GET ARG9 ( 


rD ) 




Iwz 


rD, 


(ARG 


OFF 


+ 


9*4) (sp) ; 


#def ine 


GET ARGIO ( 


rD 


) 


Iwz 


rD, 


(ARG 


OFF 


+ 


10*4) (sp) ; 


#def ine 


GET ARGll ( 


rD 


) 


Iwz 


rD, 


(ARG 


OFF 


+ 


11*4) (sp) ; 


#def ine 


GET ARG12 ( 


rD 


) 


Iwz 


rD, 


(ARG 


OFF 


+ 


12*4) (sp) ; 


#def ine 


GET ARG13 ( 


rD 


) 


Iwz 


rD, 


(ARG 


OFF 


+ 


13*4) (sp) ; 


#def ine 


GET ARG14 { 


rD 


) 


Iwz 


rD, 


(ARG 


OFF 


+ 


14*4) (sp) ; 


#def ine 


GET ARG15 ( 


rD 


) 


Iwz 


rD, 


(ARG 


OFF 


+ 


15*4) (sp) ; 


#def ine 


GET ARG16 ( 


rD 


) 


Iwz 


rD, 


(ARG OFF 


+ 


16*4) (sp) ; 


#def ine 
/* 


GET ARG17 ( 


rD 


) 


Iwz 


rD, 


(ARG_ 


_OFF 


+ 


17*4) (sp) ; 



* 

*/ 

#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



macros to set GPR arguments beyond 8 



SET ARG8( rD ) 
SET ARG9{ rD ) 
SET ARGIO ( rD 
SET ARGll ( rD 
SET ARG12 ( rD 
SET ARG13 ( rD 
SET ARG14 ( rD 
SET ARG15( rD 
SET ARG16( rD 



stw 
stw 
stw 
stw 
stw 
stw 
stw 
stw 
stw 



rD, 
rD, 
rD, 
rD, 
rD, 
rD, 
rD, 
rD, 
rD, 



(ARG 
(ARG 
(ARG 
(ARG 
(ARG 
(ARG 
(ARG 
(ARG 
(ARG 



OFF + 

OFF + 
OFF + 
OFF + 
OFF + 
OFF + 
OFF + 
OFF + 
OFF + 



8*4) (sp) ; 
9*4) (sp) ; 
10*4) (sp) 
11*4) (sp) 
12*4) (sp) 
13*4) (sp) 
14*4) (sp) 
15*4) (sp) 
16*4) (sp) 
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#define SET_ARG17( rD ) stw rD, (ARG_OFF + 17*4) (sp) ; 

/* 

* macro to branch from one entry point to another 
*/ 

#define BR FUNC ( func_name ) \ 
b f unc_name ; 

/* 

* macros to call functions 
*/ 

#define CALL FUNC ( func_name ) \ 
bl f unc_name ; 

#define CALL 0( func name ) \ 
CALL_FUNC( func_name ) 

#define CALL 1( fiinc name, argO ) \ 
CALL_FUNC ( f unc_name ) 

#define CALL 2( func name, argO, argl ) \ 
l^^, CALL_FUNC( fiinc_name ) 

#define CALL 3( func name, argO, argl, arg2 ) \ 



CALL FUNC{ func name ) 



#define CALL 4( func name, argO, argl, arg2, arg3 ) \ 
CALL FUNC( func name ) 

tfl " " 

fill #define CALL 5( func name, argO, argl, arg2, arg3, arg4 ) \ 

^% CALL_FUNC{ func__name ) 

s #define CALL 6( func name, argO, argl, arg2, argS, arg4, arg5 ) \ 

CALL FUNC { func name ) 



W #define CALL 7( func name, argO, argl, arg2, arg3, arg4, arg5, arg6 ) \ 

|*^= CALL FUNC( func name ) 



#define CALL 8( func name, argO, argl, arg2, arg3, arg4, argS, arg6, arg7 ) \ 

CALL_FUNC ( f unc_name ) 

#define CALL_9 ( func_name, argO, argl, arg2, arg3, arg4 , argS, arg6, argV , \ 
argS ) \ 
CALL_FUNC( func_name ) 

#define CALL_10( func name, argO, argl, arg2, argS , arg4 , argS, arg6, arg7, \ 
argS, arg9 ) \ 
CALL_FUNC{ func_name ) 

#define CALL_11 ( func name, argO, argl, arg2, arg3, arg4, argS, arg6, arg7, \ 
argS, arg9, arglO ) \ 
CALL_FUNC ( f unc_name ) 

ttdefine CALL_12 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
argS, arg9, arglO, argil ) \ 
CALL_FUNC( func_name ) 

#define CALL_13 ( func name, argO, argl, arg2, arg3, arg4 , arg5, arg6, arg7, \ 
argS, arg9, arglO, argil, argl2 ) \ 
CALL_FUNC( func_name ) 

ttdefine CALL_14 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
argS, arg9, arglO, argil, argl2, argl3 ) \ 
CALL_FUNC( fimc_name ) 

#define CALL_15( func_name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
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argS, arg9, arglO, argil, argl2, argl3, argl4 ) \ 

CALL FUNC( func name ) 



#define CALL_16 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, argV, \ 
arg8, arg9, arglO, argil, argl2, argl3, argl4, argl5 ) \ 

CALL FUNC( func_name ) 



p 

C3 



m 
m 



#if defined { BUILD MAX ) 

#if defined ( COMPILE_ESAL_JUMP_TABLE ) 

/* 

* G4 macros to create an ESAL jump table for 1, 2, 3 and 4 vector 

* algorithms. The table name is <root_name>_jump and is made a 

* local symbol, (not supported in C) 
*/ 

#define DECLARE VMX_V1 ( root_name ) \ 
.section .rodata; \ 
.align 5; \ 

CONCAT( root name, jump ) : \ 
.long CONCAT( root name, n ) ; \ 
.long CONCAT( root_name, _c ); 

#define DECLARE VMX_V2 { root_name ) \ 
. section . rodata ; \ 
.align 5; \ 

CONCAT( root name, jump ) : \ 
.long CONCAT( root name, nn ) ; \ 
.long CONCAT( root name, nc ) ; \ 
.long CONCAT( root name, cn ) ; \ 
.long CONCAT( root_name, _cc ) ; 

#define DECLARE ViyiX_V3 ( root_name ) \ 
.section .rodata; \ 
.align 5; \ 



CONCAT ( 


root name. 


jump ) 


: \ 






. long 


CONCAT ( 


root 


name. 


nnn 


/ / 


\ 


. long 


CONCAT ( 


root 


name. 


nnc 




\ 


. long 


CONCAT ( 


root 


name. 


ncn 




\ 


. long 


CONCAT ( 


root 


name , 


ncc 




\ 


. long 


CONCAT { 


root 


name. 


cnn 


/ f 


\ 


, long 


CONCAT { 


root 


name. 


one 




\ 


. long 


CONCAT { 


root 


name. 


ccn 


/ / 


\ 


. long 


CONCAT ( 


root_ 


_name , 


ccc 


/ / 





#def ine DECLARE VMX_V4 { root_name ) \ 
.section .rodata; \ 



. align 


5; \ 










CONCAT ( 


root name. 


jump ) 


: \ 




.long 


CONCAT ( 


root 


name. 


nnnn ) , 


\ 


.long 


CONCAT ( 


root 


name. 


nnnc ) , 


\ 


.long 


CONCAT ( 


root 


name. 


nncn ) , 


\ 


.long 


CONCAT { 


root 


name. 


nncc ) , 


' \ 


. long 


CONCAT ( 


root 


name. 


ncnn ) , 


• \ 


. long 


CONCAT ( 


root 


name. 


ncnc ) , 


' \ 


. long 


CONCAT ( 


root 


name. 


nccn ) , 


^ \ 


. long 


CONCAT ( 


root 


name. 


nccc ) t 


' \ 


. long 


CONCAT { 


root 


name. 


cnnn ) , 


' \ 


. long 


CONCAT ( 


root 


name. 


cnnc ) t 


' \ 


. long 


CONCAT { 


root 


name. 


cncn ) 


' \ 


. long 


CONCAT { 


root 


name. 


cncc ) 


' \ 


. long 


CONCAT { 


root 


name. 


ccnn ) 


• \ 


. long 


CONCAT ( 


root 


name. 


ccnc ) t 


' \ 


.long 


CONCAT ( 


root 


name. 


cccn ) , 


' \ 


. long 


CONCAT ( 


root_ 


_name , 


_cccc ) 




#define DECLARE VMX_ 


_V5 ( root_name 


) \ 



.section .rodata; \ 
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tn. 



|4: 

o 



. al ign 


5; \ 










CUNCAl 1 


root name. 






. long 


CONCAT ( 


root 


name, 


i. 11 11 XX J. 11 / / 


. long 


CONCAT ( 


root 


name , 


iiiiiiiiv^ / / 




. long 


CONCAT ( 


root 


name , 


llXXllv^ll / / 


\ 


. long 


CONCAT ( 


root 




nT\T\ rr* \ • 

llllllV_.v^ / / 


\ 


. long 


CONCAT ( 


root 


name 


1111^1111 / / 


\ 


, long 


CONCAT ( 


root 


name 


X.lXlv^HV' / f 




. long 


CONCAT ( 


root 


name 






. long 


CONCAT ( 


root 


name 




\ 


- long 


CONCAT ( 


root 


name 


■nrTiTTn ^ • 
J jii\.»imxi / f 




. long 


CONCAT ( 


root 


name 


ncnno ) * 


\ 


. long 


CONCAT ( 


root 


nam.e 


f ilCliv^iX } t 




. long 


CONCAT ( 


root 


name 






. long 


CONCAT ( 


root 


name 






, long 


CONCAT { 


root 


name 




\ 


. long 


CONCAT ( 


root 


name 


f XXv^^l^XX / 1 




. long 


CONCAT ( 


root 


name 






, long 


CONCAT ( 


root 


name 


, cnnnn / , 




. long 


CONCAT ( 


root 


name 


f cnnno ; , 




. long 


CONCAT ( 


root 


name 


, cnncn ; 


\ 


. long 


CONCAT ( 


root 


name 


, cnncc / 


\ 


. long 


CONCAT ( 


root 


name 


f ^XXCXXIX } 


\ 


.long 


CONCAT ( 


root 


name 


, cncnc ) 


• \ 


. long 


CONCAT ( 


root 


name 


, cnccn ) 


• \ 


.long 


CONCAT ( 


root 


name 


, cnccc ) 


• \ 


.long 


CONCAT ( 


root 


name 


, ccnnn ) 


'f \ 


.long 


CONCAT ( 


root 


name 


, ccnnc ) 




.long 


CONCAT ( 


root 


name 


, ccncn ) 


r \ 


.long 


CONCAT ( 


root 


name 






.long 


CONCAT ( 


root 


name 


, cccnn ) 


; \ 


. long 


CONCAT ( 


root 


name 


, cccnc ) 


\ 


. long 


CONCAT { 


root 


name 


, ccccn ) 


; \ 


. long 


CONCAT { 


root 


_name 


, _ccccc ) 




#define DECLARE VMX 


zi( 


root name 


) 


#define DECLARE VMX 


Z2 ( 


root name 


) 


#define DECLARE VMX 


Z3( 


root name 


) 


#define DECLARE VMX 


Z4 { 


root name 


) 


#def ine DECLARE VMX 


_Z5( 


root_name 


) 



DECLARE VMX VI { root name ) 
DECLARE VMX V2 ( root name ) 
DECLARE VMX V3 ( root name ) 
DECLARE VMX V4 ( root name ) 
DECLARE VMX V5 ( root name ) 



G4 macros to branch through the <root name> jump table based on 
the value of the ESAL flag, (not supported in C) 
(uses rO as scratch and destroys eflag) 
(not supported in C) 



*/ 

#define BR ESAL_JUMP TABLE_COMMON ( root name, rtemp 
addis rtemp, 0, CONCAT ( root name, jump@ha ); \ 
addi rtemp, rtemp, CONCAT ( root_name, _jump®l ); 
Iwzx rtemp, rtemp, rO; \ 
mtctr rtemp; \ 
bctr; 

#define BR VMX VI ( root_name, eflag, rtemp ) \ 
rlwinm rO, eflag, 2, 29, 29; \ 
BR_ESAL_JUMP_TABLE_COMMON { root_name, rtemp ) 

#define BR VMX V2 ( root_name, eflag, rtemp ) \ 
rlwinm rO, eflag, 2, 28, 29; \ 

BR_ESAL_JUMP_TABLE_COMMON ( root__name, rtemp ) 

ttdefine BR VMX V3 ( root_name, eflag, rtemp ) \ 
rlwinm rO, eflag, 2, 27, 29; \ 

BR_ESAL_JUMP_TABLE_COMMON ( root_name, rtemp ) 



) \ 
\ 



#define BR_VMX_V4 ( root_name, eflag, rtemp ) \ 
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rlwinm rO, eflag, 2, 26, 29; \ 

BR_ESAL_JUMP_TABLE_COMMON ( root_name , rtemp ) 

#define BR VMX V5 ( root_name, eflag, rtetnp ) \ 
rlwinm rO, eflag, 2, 25, 29; \ 



BR_ESAL_ 


_JUMP_TABLE__COMMON ( 


root_name, rtemp 


#def ine 


BR 


VMX 


Zl( 


root 


name, 


eflag. 


rtemp 


) 


\ 




BR_ 


_VMX_ 


VI ( 


root_ 


name , 


eflag. 


rtemp 


) 




#def ine 


BR 


VMX 


Z2( 


root 


name. 


eflag, 


rtemp 


) 


\ 




BR_ 


_VMX_ 


_V2 ( 


root_ 


name , 


eflag. 


rtemp 


) 




#define 


BR 


VMX 


Z3( 


root 


name. 


eflag. 


rtemp 


) 


\ 




BR_ 


_VMX_ 


_V3( 


root_ 


name , 


eflag. 


rtemp 


) 




#def ine 


BR 


VMX 


Z4 ( 


root 


name. 


eflag. 


rtemp 


) 


\ 




BR_ 


_VMX_ 


_V4 ( 


root_ 


_name , 


eflag. 


rtemp 


) 




#define 


BR 


VMX 


Z5( 


root 


name. 


eflag. 


rtemp 


) 


\ 




BR_ 


_VMX_ 


_V5{ 


root_ 


_name , 


eflag. 


rtemp 


) 





#else /* no ESAL jump table */ 



/* 

* G4 macros to create a dummy jump table. 

* (not supported in C) 

* / 

\0 #define DECLARE VMX Vl( root name 

HI #define DECLARE VMX V2 ( root name 

S #def ine DECLARE VMX V3 { root name 

r^: #def ine DECLARE VMX V4 ( root name 

Ij #def ine DECLARE VMX_V5 ( root_name 



#define DECLARE VMX Zl ( root name 

W- #define DECLARE VMX Z2 ( root name 

#def ine DECLARE VMX Z3 ( root name 

#def ine DECLARE VMX Z4 ( root name 

#define DECLARE VMX_Z5 ( root_name 



* G4 macros to simply branch to root_name (no jump table) 

* (not supported in C) 

*/ 

#define BR VMX VI ( root_name, eflag, rtemp ) \ 
b root_name; 

#define BR VMX V2 ( root_name, eflag , rtemp ) \ 

b root_name; 

#define BR VMX V3 ( root_name, eflag , rtemp ) \ 
b root_name; 

#define BR VMX V4 ( root_name, eflag , rtemp ) \ 
b root_name; 

#define BR VMX V5 ( root_name, eflag , rtemp ) \ 
b root name; 



#def ine 


BR VMX Zl( 


root 


name. 


eflag. 


rtemp 


) 


\ 




BR_VMX_V1 ( 


root_ 


name , 


eflag. 


rtemp 


) 




#def ine 


BR VMX Z2 ( 


root 


name. 


eflag. 


rtemp 


) 


\ 




BR_VMX_V2 ( 


root_ 


_name , 


eflag. 


rtemp 


) 




#def ine 


BR VMX Z3 ( 


root 


name. 


eflag. 


rtemp 


) 


\ 




BR_VMX_V3 ( 


root_ 


_name , 


eflag, 


rtemp 


) 
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#define BR VMX Z4 ( root name, eflag, rtemp ) \ 

BR_VMX_V4 ( root_name, eflag, rtetnp ) 

ftdefine BR VMX Z5 ( root name, eflag, rtemp ) \ 

BR VMX V5{ root name, eflag, rtemp ) 



#endif 



/* end COMPILE_ESAL_JUMP_TABLE */ 



f*1 



m 
m 



/* 

■k 

•k 
* 



* 

*/ 



G4 macros to decide whether to enter a VKX loop 
VMX loop is entered if at least minimum count, 
all vectors have the same relative alignment 

(i.e., same lower 4 bits) and all strides are unit. 
Note, a unit s imm argument is provided because some 
packed interleaved complex functions (stride 2) such 
as cvaddxO can be implemented with a VMX loop. 
Only one macro should be invoked per source file. 

(uses rO as scratch) 

(not supported in C) 



#define BR IF VMX Vl( root_name, min_n_imm, unit_s_iram, pi, si, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne V skip_vmx; \ 

BR VMX VI ( root_name, eflag, si ) \ 
v_skip_vmx: 

#define BR__IF_VMX_V1_ALIGNED { root name, min n_imm, unit_s_imm, \ 

pi, si, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
andi. rO, pi, Oxf; \ 
bne V skip_vmx; \ 

BR VMX VI ( root_name, eflag, si ) \ 
v_skip_vmx : 

#define BR_IF_VMX_V2 ( root name, min n imm, unit_s_imm, \ 
~ pi, si, p2, s2, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
xor rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne V skip_vmx; \ 

BR VMX V2( root_name, eflag, si ) \ 

v_skip_vmx: 

#define BR_IF_VMX_V2_LS ( root name, min n imm, unit_s__imm, \ 

pi, si, ps, s2, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 



cmpwi si , unit s imm; 
srwi rO, pi, 1; \ 
bne v_skip vmx; \^ 
cmpwi s2, unit s imm; 
xor rO, rO, ps; \ 
bne v_skip vmx; \ 
andi. rO, rO, 0x6; \ 
bne V skip_vmx; \ 
BR VMX V2 { root name , 



\ 



eflag, si ) \ 
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v_skip_vmx : 

#define BR_IF_VMX_V2_LC ( root name, Tnin_n imm, unit_s_itnm, \ 

pi, si, pc, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
andi, rO, pc, 1; \ 
bne v_skip vmx; \ 
cmpwi si, unit s imm; \ 
srwi rO, pi, 2; \ 
bne V skip vmx; \ 
xor rO, rO, pc; \ 
andi. rO, rO, 0x3; \ 
bne V skip_vmx; \ 

BR VMX V2 ( root_naTne, eflag, si ) \ 

v_skip_vrax : 

ttdefine BR IF_VMX_V2_ALIGNED ( root name, min n iram, unit_s_imm, \ 
~ pi, si, p2, s2, n, eflag ) \ 

cmplwi n, min n_imm; \ 

bit v_skip vmx; \ 

cmpwi si, unit s imm; \ 
Isss: bne v_skip vmx; \ 

cmpwi s2, unit_s_imm; \ 
5S" or rO, pi, p2; \ 

bne v_skip vmx; \ 
^5 andi. rO, rO, Oxf; \ 

i^fl bne V skip_vmx; \ 

;5 BR VMX V2( root_name, eflag, si ) \ 

W V skip vmx: 

la " " . . . , 

#define BR__IF_VMX_V3 { root name, mm n xmm, unxt_s imm, \ 

pi, si, p2, s2, p3, s3, n, eflag ) \ 
^ cmplwi n, min n_imm; \ 

C3 bit v_skip vmx; \ 

|7n cmpwi si, unit s imm; \ 

r^' bne v_skip vmx; \ 

1*^ cmpwi s2, unit s imm; \ 

bne v_skip vmx; \_ 
^U, cmpwi s3, unit s imm; \ 

xor rO, pi, p2; \ 
ftp bne v_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO , pi , p3 ; \ 

bne v_skip vmx; \ 

andi. rO, rO, Oxf; \ 

bne V skip_vmx; \ 

BR VMX V3 ( root_name, eflag, si ) \ 

v_skip_vmx: 

#define BR IF_VMX_V3_ALIGNED ( root name, min n imm, unit_s imm, \ 

pi, si, p2, s2, p3, S3, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s3, unit_s_imm; \ 
or rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne V skip_vmx; \ 

BR VMX V3 ( root_name, eflag, si } \ 
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v_skip_VTnx : 

#define BR IF VMX V4 ( root name, min n imm, unit s imm, \ 

" ~ pi, si, p2, s2, p3, S3, p4, s4, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vrax; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ ^ 
cmpwi s2, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s3, unit s imm; \ 
bne v_skip vmx; \^ 
cmpwi s4, unit s imm; \ 
xor rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p4; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
i-^ bne V skip_vmx; \ 

BR VMX V4( root__name, eflag, si ) \ 
v_skip_ymx : 



1*1 



^-5 #define BR IF VMX V4 ALIGNED { root name, min n imm, unit s imm, \ 

- - - - p2, s2, p3, S3, p4, s4, n, eflag ) \ 

J J cmplwi n, min n_imm; \ 

w bit v_skip vmx; \ 

Cfl cmpwi si, unit s imm; \ 

bne v_skip vmx; \ 

cmpwi s2, unit s imm; \ 
^ bne v_skip vmx; \ 

13 cmpwi s3, unit s imm; \ 

ifij bne v_skip vmx; \ 

cmpwi s4, unit_s_imm; \ 
r^- or rO, pi, p2 ; \ 

bne v_skip vmx; \ 

andi. rO, rO, Oxf; \ 
W; or rO, pi, p3; \ 

fy bne v_skip vmx; \ 

andi. rO, rO, Oxf; \ 

or rO, pi, p4; \ 

bne v_skip vmx; \ 

andi. rO, rO, Oxf; \ 

bne V skip_vmx; \ 

BR VMX V4 ( root^name, eflag, si ) \ 

v_skip_vmx: 

#define BR IF VMX V5 { root name, min n imm, unit s imm, \ 

~ ~ ~ pi, si, p2, s2, p3, s3, p4, s4, p5, s5, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ ^ 
cmpwi s2, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s3, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s4, unit s imm; \ 
bne v_skip vmx ; \ ^ 
cmpwi s5, unit s imm; \ 
xor rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p3; \ 



22 



Page No. 359 



EV 093 931 797 US 

^^96 No 3/9/2001 

bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p4; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p5; \ 
bne v__skip vmx; \ 
andi. rO , rO , Oxf; \ 
bne V skip_vmx; \ 

BR VMX V5( root_name, eflag, si ) \ 
v__skip_vrnx : 

ftdefine BR IF VMX V5_ALIGNED( root_name, min n_irntn, unit s__innn, \ 

~ ~ ~ pi, si, p2, s2, p3, S3, p4, s4, p5, s5, n, eflag ) 

\ 

cmplwi n, min n_imm; \ 

bit v_skip vmx; \^ 

cmpwi si, unit s imm; \ 

bne v_skip vmx; \ 

cmpwi s2, unit s imm; \ 

bne v_skip vmx; \ 

cmpwi s3, unit s imm; \ 
1,=:: bne v_skip vmx; \ 

cmpwi s4, unit s imm; \ 

bne v_skip vmx; \ 
C3 cmpwi s5, unit_s_imm; \ 

^0; or rO, pi, p2; \ 

."S bne v_skip vmx; \ 

W- andi. rO, rO, Oxf; \ 

IB or rO, pi, p3; \ 

|f5 bne v_skip vmx; \ 

andi. rO, rO, Oxf; \ 
^ or rO, pi, p4; \ 

5 bne v_skip vmx; \ 

f^.: andi. rO, rO, Oxf; \ 

z'i or rO, pi, p5; \ 

^ bne v_skip vmx; \ 

|«5-. andi. rO, rO, Oxf; \ 

V" bne V skip_vmx; \ 

*r BR VMX V5( root_name, eflag, si ) \ 

O v_skip_vmx : 



#define BR_IF_VMX_Z1 ( root_name, min n_imm, unit_s_imm, \ 

prl, pil, si, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit z__skip vmx; \_ 
cmpwi si, unit s imm; \ 
xor rO, prl, pi2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne z skip_vmx; \ 

BR VMX Zl( root_name, eflag, si ) \ 
z_skip_vmx : 

#define BR_IF_VMX_Z2 ( root__name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, n, eflag ) \ 
cmplwi n, min n__imm; \ 
bit z_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne z_skip vmx; \ 
cmpwi s2, unit s imm; \ 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
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xor rO, prl, pi2; \ 
bne z_skip vitix; \ 
andi. rO, rO, Oxf ; \ 
bne z skip_vmx; \ 

BR VMX Z2 ( root_name, eflag, si ) \ 

z_skip__vmx: 

#define BR_IF_VMX_Z3 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, n, eflag ) \ 

cmplwi n, min n_imtn; \ 
bit z_skip vmx; \ 



cmpwi si, unit s imm; 
bne z_skip vmx; \ 
cmpwi s2 , unit s imm; 
bne z_skip vmx; \ 
cmpwi S3, unit s imm; 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr3; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi3; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne z skip_vmx; \ 
BR VMX Z3 ( root_name, 
z_skip_vmx : 



eflag, si ) \ 



#define BR_IF_VMX_Z4 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3 , pi3, s3, 
pr4, pi4, s4, n, eflag ) \ 

cmplwi n, min n__imm; \ 

bit z_skip vmx; \ 



cmpwi si, unit s imm; 
bne z_skip vmx; \ 
cmpwi s2, unit s imm; 
bne z_skip vmx; \ 
cmpwi s3, unit s imm; 
bne z_skip vmx; \ 
cmpwi s4, unit s imm; 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr2; \ 
bne z_sklp vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr3; \ 
bne z__skip vmx; \ 
andi. rO, rO, Oxf; \ 



\ 



\ 



xor rO, prl, pi3; 
bne z_skip vmx; \ 
andi. rO, rO, Oxf 
xor rO , prl , pr4 ; 
bne z_skip vmx; \ 
andi. rO, rO, Oxf 
xor rO, prl, pi4; \ 
bne z_skip_vmx; \ 



\ 



\ 
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andi. rO, rO, Oxf; \ 
bne z skip_vnix; \ 

BR VMX Z4 ( root__name, eflag, si ) \ 

z__skip_VTnx: 

#define BR IF VMX_Z5 ( root_naTne, min n imm, unit s imm, \ 

~ ~ prl, pil, si, pr2, pi2, si, pr3 , pi3, s3, \ 

pr4, pi4, s4, pr5, pi5, s5, eflag ) \ 

cmplwi n, min n_imm; \ 

bit z_skip vTTDc; \ 

cmpwi si, unit s imm; \ 

bne z_skip vmx; \^ 

cmpwi s2 , unit s imm; \ 

bne z_skip vmx; \ 

cmpwi s3, unit s imm; \ 

bne z_skip vmx; \_ 

cmpwi s4, unit s imm; \ 

bne z__skip vmx; \ 

cmpwi s5, unit s imm; \ 

xor rO, prl, pil; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 
1*5: xor rO, prl, pr2 ; \ 

bne z_skip vmx; \ 
1% andi. rO, rO, Oxf; \ 

M xor rO, prl, pi2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr3 ; \ 

bne z_skip vmx; \ 
Ifl andi. rO, rO, Oxf; \ 

Ijli xor rO, prl, pi 3; \ 

bne z_skip vmx; \ 
5 andi. rO, rO, Oxf; \ 

|3 xor rO, prl, pr4; \ 

1% bne z_skip vmx; \ 

p andi. rO, rO, Oxf; \ 

xor rO, prl, pi4; \ 
j^':. bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 
y xor rO, prl, pr5; \ 

m bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi5; \ 

bne z_skip vmx; \ 
andi, rO, rO, Oxf; \ 
bne z skip_vrax; \ 

BR VMX Z5( root_name, eflag, si ) \ 

z_skip_vmx: 

#define BR__IF_VMX_CONV ( root name, min n imm, \ 

pi, si, s2, p3, s3, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, 1; \ 
bne v_skip vmx; \ 
cmpwi s2, 1; \ 
beq PC OFFSET ( 12 ) ; \ 
cmpwi s2, -1; \ 
bne v_skip vmx; \ 
cmpwi s3, 1; \ 
xor rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne V skip_vmx; \ 

BR VMX V3( root_name, eflag, si ) \ 
v_skip_vmx : 
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#define BR_IF_VMX_ZCONV ( root_name, min n imm, \ 

prl, pil, si, s2, pr3, pi3, s3 , n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit z_skip vmx; \ 
cmpwi si, 1; \ 
bne z_skip vttdc; \ 
cmpwi s2, 1; \ 
beq PC OFFSET ( 12 ) ; \ 
cmpwi s2, -1; \ 
bne v_skip vmx; \ 
cmpwi s3, 1; \ 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr3; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi3 ? \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
; . bne z skip_vmx; \ 

^^'^ BR VMX Z3{ root_name, eflag, si ) \ 

13 z__skip_jvmx : 

s /* 

"-^^ * G4 macro to get VMX unaligned word (FP) coxant 

^fli * assumes that the last 2 bits of ptr are 0 

m * sets condition code CRO 

#define GET VMX UNALIGNED_COUNT ( count, ptr ) \ 
liji neg count, ptr; \ 

^ rlwinm. count, count, 30, 30, 31; 

M I* 

* G4 macro to get VMX unaligned short count 
1^. * assumes that the last bit of ptr is 0 

'^J * sets condition code CRO 

*/ 

13 #define GET VMX UNALIGNED_COUNT_S ( count, ptr ) \ 

iCn neg count, ptr; \ 

rlwinm. count, count, 31, 29, 31; 

/* 

* G4 macro to get VMX unaligned char count 

* sets condition code CRO 
*/ 

#define GET VMX UNALIGNED_COUNT_C { count, ptr ) \ 
neg count, ptr; \ 
rlwinm. count, count, 0, 28, 31; 

/* 

* G4 macro to load and splat an FP scalar independent of alignment 
*/ 

#if defined { LITTLE ENDIAN ) 

#define SCALAR_SPLAT ( vt , vtmp, scalarp ) \ 

Ivxl vt, 0, scalarp; \ 

Ivsr vtmp, 0, scalarp; \ 

vperm vt, vt, vt, vtmp; \ 

vspltw vt, vt, 3; 
#else 

#define SCALAR_SPLAT( vt, vtmp, scalarp ) \ 
Ivxl vt, 0, scalarp; \ 
Ivsl vtmp, 0, scalarp; \ 
vperm vt, vt, vt, vtmp; \ 
vspltw vt, vt, 0; 
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#endif 

/* 

* G4 macro to construct an FP absolute value mask that can be used with 

* vand to take the absolute value of 4 FP numbers in a vector register 

* vt = 0x7fffffff7fffffff7fffffff7fffffff 
*/ 

#define MAKE VABS MASK( vt ) \ 
vspltisw vt, -1; \ 
vslw vt, vt, vt; \ 
vnor vt, vt, vt; 

/* 

* G4 macro to construct an FP sign mask that can be used with: 

* vandc to take the absolute value of 

* vor to take the negative absolute value of 

* vxor to negate 

* 4 FP numbers in a vector register 

* vt = 0x80000000800000008000000080000000 
*/ 

#define MAKE VSIGN_MASK( vt ) \ 
2 vspltisw vt, -1; \ 

vslw vt, vt, vt; 

in 

/* 

"j; * G4 macros to construct a coded touch stream control register 

^0 * "I" indicates argument is passed as an immediate value 

* "R" indicates argument is passed in an integer register 

^f^' * bytes_j)er block = # of bytes in each block 

Iff * (0 = 512, 16, 32, 480, 512) 

y * block count = # of blocks (0 = 256, 1, 2, 3, ... 256) 

* byte stride = signed byte stride between start of adjacent blocks 

* (-32768 <= byte stride < 0; 0 - 32768; 0 < byte stride < 32768) 

13 */ 

liii #define MAKE STREAM CODE rB, bytes per block, block count, byte stride ) 

rr \ " " ~ " " ' 

lis rB, ((((bytes per block) » 4) & 31) « 8) ] ( {block_count ) & 255); \ 
4S ori rB, rB, ( {byte_stride) & OxOOOOffff ) ; 



#define MAKE STREAM CODE ( rB, bytes per block, block count, byte stride ) \ 

MAKE_STREAM_CODE_III { rB, bytes_per_block, block_count, byte_stride ) 

#define MAKE__STREAM__CODE_IIR { rB, bytes_j)er_block, block_count, byte_stride ) 

lis rB, ({({bytes per block) » 4) & 31) « 8) | ( (block_count) & 255); \ 
rlwimi rB, byte_stride, 0, 16, 31; 

#define MAKE_STREAM_CODE_IRI ( rB, bytes_per_block, block_count, byte_stride ) 

rlwinm rB, block count, 16, 8, 15; \ 

oris rB, rB, ((({bytes per^block) » 4) & 31) « 8); \ 
ori rB, rB, ( (byte^stride) & OxOOOOffff) ; 

#define MAKE_STREAM_CODE IRR( rB, bytes_per_block, block count, byte stride ) 
\ 

rlwinm rB, block count, 16, 8, 15; \ 

oris rB, rB, ({({bytes perjblock) » 4) & 31) << 8); \ 
rlwimi rB, byte__stride , 0, 16, 31; 

#define MAKE_STREAM__CODE_RII( rB, bytes_per_block, block_count, byte_stride ) 

rlwinm rB, bytes per block, 20, 3, 7; \ 
oris rB, rB, {(block count) & 255); \ 
ori rB, rB, ( {byte_stride) & OxOOOOffff) ; 

#define MAKE__STREAM__CODE_RIR ( rB, bytes_per_block, block_count, byte_stride ) 
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rlwinm rB, bytes per block, 20, 3, 7; \ 
oris rB, rB, ( {block count) & 255) ; \ 
rlwimi rB, byte_stride, 0, 16, 31; 

#define MAKE_STREAM_CODE_RRI ( rB, bytes^er_block, block_count, byte_stride ) 
\ 

rlwinm rB, bytes per block, 20, 3, 7; \ 

rlwimi rB, block count, 16, 8, 15; \ 

ori rB, rB, ( (byte_stride) & OxOOOOf fff ) ; 

#define MAKE_STREAM_CODE_RRR ( rB, bytes_jper_block, block_count, byte_stride ) 
\ 

rlwinm rB, bytes per block, 20, 3, 7/ \ 
rlwimi rB, block count, 16, 8, 15; \ 
rlwimi rB, byte_stride, 0, 16, 31; 

#endif /* end BUILD_MAX */ 

#def ine CACHE TB THRESHOLD 1 /* 2 TB ticks = 12 CPU 100 MHz elks */ 

#define INSTRUCTION CACHE COUNT 3 /* min. to fully cache instructions */ 

#define POSTING_BUFFER_COUNT 10 /* min, to fill posting buffer */ 

f*^: * macros to set DCBx conditions explicitly 

* / 

-0 #define DCBT TRUE { cond__bit, scratch ) \ 

'%Q li scratch, 0; \ 

ffii cmplwi (cond__bit} , scratch, 1; 

J.^f #define DCBZ TRUE( cond_bit, scratch ) \ 

111 DCBT__TRUE( cond_bit, scratch ) 

#define DCBT FALSE ( cond_bit, scratch ) \ 
U= li scratch, 2; \ 

lij cmplwi (cond_bit) , scratch, 1; 



^define DCBZ FALSE ( cond_bit, scratch ) \ 
DCBT FALSE ( cond bit, scratch ) 



* This macro will cause a file not to assemble. 
*/ 

#define DO_NOT_ASSEMBLE add scratchl, scratch2, 256; 
/* 

* Obsolete macro will cause assembler error 
*/ 

#define TEST IF CACHABLE{ cond__bit, buffer, scratchl, scratch2 ) \ 
DO_NOT ASSEMBLE 

/* 

* Obsolete macro will cause assembler error 
*/ 

#define TEST IF CACHABLE_ALIGN { condjoit, buffer, scratchl, scratch2 ) \ 

DO_NOT_ASSEMBLE 

/* 

* macros to test if a DCBT or DCBZ instruction should be performed on 

* a particular buffer based on a bit test (cache bit) on a specified 

* ESAL flag. 
*/ 

#define TEST IF DCBT( cond bit, cache bit, eflag, bufer, scratchl, scratch2 ) 
DO__NOT_ASSEMBLE 

#define SET_DCBT_COND ( cond_bit, cache_bit, eflag, scratchl ) \ 
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andi. scratchl, eflag, (cache bit); \ 
cmplwi (cond__bit) , scratchl, 0; 

/* 

* Set 2 debt conditions and ensure only one is true 

•k 

* Ins. 1-3 Set both conditions to "No DCBT" 

* Ins. 4 See if vecl has a C 

* Ins. 5 Set DCBT condl 

* Ins. 6 Branch if "DCBT TRUE" (eflag & bitl = 0) 

* Ins. 7-8 Set DCBT cond2 
*/ 

#define SET_2_DCBT_C0ND { condl bit, cache_bitl, cond2_bit, cache_bit2, \ 

eflag, scratch ) \ 

li scratch, 2; \ 

cmplwi (condl bit) , scratch, 1; \ 

cmplwi (cond2 bit) , scratch, 1; \ 

andi. scratch, eflag, (cache_bitl) ; \ 

cmplwi (condl bit), scratch, 0? \ 

be 12, ( (condl_bit)<<2)+2, PC OFFSET( 12 ) ; \ 

andi. scratch, eflag, (cache_bit2) ; \ 

cmplwi (cond2__bit) , scratch, 0; 

fi /* 

ZZ. * Set 3 debt conditions and ensure only one is true 

* 

\Q * Logic is the similar to SET_2_DCBT_C0ND (} macro 

h.f^ * / 

#define SET_3_DCBT_C0ND ( condl bit, cache bitl, cond2 bit, cache_bit2, \ 

cond3_bit, caehe_bit3, eflag, scratch ) \ 

O li scratch, 2; \ 

|sli cmplwi (condl bit), scratch, 1; \ 

cmplwi (cond2 bit), scratch, 1; \ 
^ cmplwi (cond3 bit), scratch, 1; \ 

Q andi- scratch, eflag, (cache_bit3) ; \ 

frfi cmplwi (cond3 bit) , scratch, 0; \ 

be 12, ( (cond3__bit) <<2) +2, PC OFFSET ( 24 ) ; \ 
1=*^= andi. scratch, eflag, (cache_bit2) ; \ 

cmplwi {cond2 bit), scratch, 0; \ 

be 12, ( {cond2_bit)<<2)+2, PC OFFSET( 12 ) ; \ 
J;f andi. scratchy eflag, {cache_bitl) ; \ 

fy cmplwi (condl__bit) , scratch, 0; 

/* 

* Set 4 debt conditions and ensure only one is true 
* 

* Logic is the similar to SET 2 DCBT CONDO macro 
*/ 

#define SET_4_DCBT__C0ND ( condl bit, cache bitl, cond2 bit, cache bit2, \ 

cond3 bit, cache_bit3, cond4_bit, cache_bit4, \ 
eflag, scratch ) \ 

li scratch, 2; \ 

cmplwi (condl bit), scratch, 1; \ 

cmplwi (cond2 bit), scratch, 1; \ 

cmplwi (cond3 bit), scratch, 1; \ 

cmplwi (cond4 bit) , scratch, 1; \ 

andi. scratch, eflag, (cache_bit4) ; \ 

cmplwi (cond4 bit), scratch, 0; \ 

be 12, ( (cond4_bit) <<2) +2, PC OFFSET ( 36 ) ; \ 

andi. scratch, eflag, (cache_bit3) ; \ 

cmplwi (cond3 bit), scratch, 0; \ 

be 12, ( (cond3_bit) <<2) +2, PC OFFSET ( 24 ) ; \ 

andi. scratch, eflag, (cache_bit2) ; \ 

cmplwi (cond2 bit), scratch, 0; \ 

be 12, ( (cond2_bit) <<2) +2, PC OFFSET ( 12 ) ; \ 

andi. scratch, eflag, (cache_bitl) ; \ 

cmplwi (condl_bit) , scratch, 0; 
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#if 'defined COMPILE_NO_DCBZ 

#define SET_DCBZ_COND ( cond bit, cache bit, eflag, buffer, stride, \ 

unit stride, count, tmpl, tmp2, tinpS) \ 
andi. tmp3, eflag, (cache bit); \ 
cmplwi (cond bit), tmp3 , 0; \ 
bne PC_OFFSET( 104 ); \ 
cmplwi 1, stride, unit stride; \ 
bne 1, PC_OFFSET( 92 ) ; \ 

cmplwi 1, count, (CACHE_LINE_LSIZE«unit_stride) ; \ 

bit 1, PC OFFSET ( 84 ) ; \ 

addi tmp2, buffer, CACHE LINE SIZE; \ 

li tmp3, CACHE LINE ADDR^MASK; \ 

and tmp2, tmp2, tmp3; \ 

mfcr tmp3; \ 

stw tmp3, CR__SAVE_OFF(sp) ; \ 
mflr tmpS; \ 

stw tmp3, LR SAVE OFF{sp); \ 

CREATE STACK__FRAME { 0 ) \ 

tnr tmpl , r3 ; \ 

mr r3, tmp2; \ 
hh bl ppc buf is dcbz safe; \ 

DESTROY STACK FRAME \ 
1%. Iwz tmp3, LR_SAVE_OFF(sp) ; \ 

mtlr tmp3; \ 

Iwz tnp3, CR_SAVEjOFF(sp) ; \ 
if^i mtcr tmp3; \ 

li tmp2, 0; \ 
cmplw 1, tmp2, r3; \ 

IB mr r3, tmpl; \ 

I j bne 1 , PC OFFSET ( 8 } ; \ 

cmpwi (cond_bit) , count, -1; 

13 #define SET__DCBZ_ALIGN_COND { cond bit, cache bit, eflag, buffer, stride, \ 

unit stride, count, tmpl, tmp2 , tmp3) \ 
andi. tmp3, eflag, (cache bit); \ 
i^"- ' cmplwi (cond bit), tmp3, 0; \ 

bne PC_OFFSET( 100 ); \ 
cmplwi 1, stride, unit stride; \ 
^f! bne 1, PC_OFFSET( 88 ) ; \ 

|1j cmplwi 1, count, (CACHE_LINE__LSIZE«unit_stride) ; \ 

bit 1, PC CFFSET( 80 ) ; \ 
andi. tmp3, buffer, CACHE_LINE_MASK; \ 
bne PC OFFSET ( 72 ) ; \ 
mfcr tmp3; \ 

stw tmp3, CR_SAVE_OFF(sp) ; \ 
mflr tmp3; \ 

stw tmp3, LR SAVE OFF(sp); \ 

CREATE STACK_FRAME( 0 ) \ 

mr tmpl, r3; \ 

mr r3, buffer; \ 

bl ppc buf is dcbz safe; \ 

DESTROY STACK FRAME \ 

Iwz tmp3, LR__SAVE___OFF(sp) ; \ 

mtlr tmp3 ; \ 

Iwz tmp3, CR_SAVE_OFF{sp) ; \ 

mtcr tmp3 ; \ 

li tmp2, 0; \ 

cmplw 1, tmp2, r3; \ 

mr r3, tmpl; \ 

bne 1 , PC OFFSET ( 8 ) ; \ 

cmpwi ( cond_bi t ) , count , - 1 ; 

#else /* COMPILE_NO_DCBZ is defined */ 

#define SET__DCBZ_COND ( cond_bit, cache_bit, eflag, buffer, stride, \ 
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unit stride, count, tmpl, tmp2, tmp3) \ 

DCBZ_FALSE{ cond_bit, trapl ) 

#define SET_DCBZ_ALIGN_COND { cond bit, cache bit, eflag, buffer, stride, \ 

unit_stride, count, tmpl, tmp2, tmpS) \ 

DCBZ_FALSE( cond_bit, tmpl ) 

#endif /* COMPILE_NO_DCBZ */ 
/* 

* TTiacro to perform [or skip] a debt instruction based on the result 

* of a prior call to TEST IF DCBT (specifying the same condition bit) . 

* debt is performed if the cond "<=" is true; otherwise debt is skipped. 
*/ 

#define DCBT IF{ cond bit, rA, rB ) \ 

be 12, ( (cond_bit)«2)+l, PC_OFFSET( 8 ); \ 
debt rA, rB; 

/* 

* macro to perform [or skip] a debz instruction based on the result 

* of a prior call to TEST IF DCBZ (specifying the same condition bit) . 

* debz is performed if the cond "<=" is true; otherwise dcbz is skipped. 

1^-- */ 

fli #if I defined COMPILE NO DCBZ 

% #define DCBZ IF( cond bit, rA, rB ) \ 

i:0 be 12, { (cond_bit)«2)+l, PC_OPFSET{ 8 ); \ 

debz rA, rB? 



^^^^ 



#else 

#define DCBZ IF( cond bit, rA, rB ) \ 

be 12, ( {eond_bit)«2)+l, PCJ3FFSET( 8 ); \ 
nop; 

#endif 



/* 

|5 * macro to branch to a label if the buffer specified in a prior 

* call to TEST_IF CACHABLE {also specifying the same condition bit) 

* was cachable (i.e. TB read time was <= CACHE TB THRESHOLD) . 

fy */ ~ ~ 

#define BR IF COND TRUE ( cond bit, label ) \ 

be 4, ( (cond_bit) <<2)+l, label; /* <= */ 

/* 

* macro to branch to a label if the buffer specified in a prior 

* call to TEST IF CACHABLE (also specifying the same condition bit) 

* was NOT cachable (i.e, TB read time was > CACHE TB THRESHOLD) . 
*/ " " 

#define BR IF COND FALSE ( cond bit, label ) \ 

be 12, { (eond_bit)«2)+l, label; /* > */ 

/* 

* ASIC macros 
*/ 

#if defined ( COMPILE_PREFETCH ) 

#define LOAD PREFETCH_CONTROL ( mode, scratchl, scratch2 ) \ 
li scratchl, mode; \ 

addis scratch2, 0, PREFETCH CONTROL H; \ 

stw scratchl, PREFETCH_CONTROL_L ( scratch2 ); 

#define LOAD MISCON B( mode, scratchl, scratch2 ) \ 
li scratchl, mode; \ 
addis scratch2, 0, MISCON_B H; \ 
Stw scratchl, MISCON__B_L( scratch2 ); 
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#define RESET PREFETCH CONTROL ( scratchl, scratch2 ) \ 
addis scratch2, 0, ASIC H; \ 
Iwz scratchl, MISCON B L( scratch2 ); \ 
andi. scratchl, scratchl, PREFETCH MASK; \ 
ori scratchl, scratchl, USE PREFETCH CONTROL; \ 
stw scratchl, PREFETCH__CONTROL_L ( scratch2 ); 

#else 

#define LOAD PREFETCH CONTROL { mode, scratchl, scratch2 ) 
#define LOAD MISCON B( mode, scratchl, scratch2 ) 
#define RESET_PREFETCH_CONTROL ( scratchl, scratch2 ) 

#endif 



^0 



m 



/* 

* instruction macros 
*/ 

#define ADD( rD, rA, rB ) 
#define ADD C( rD, rA, rB ) 
#define ADDI { rD, rA, SIMM ) 
#define ADDIC C( rD, rA, SIMM ) 
#define ADDIS ( rD, rA, SIMM ) 
#define AND ( rA, rS, rB ) 
#define AND C( rA, rS, rB ) 
#define ANDC ( rA, rS, rB ) 
#define ANDC C( rA, rS, rB ) 
#define ANDI C( rA, rS, UIMM ) 
#define ANDIS C{ rA, rS, UIMM ) 
#define BA( label ) 
#define BCTR 
#define BCTRL 
#define BEQ ( label ) 
#define BEQ PLUS ( label ) 
#define BEQ MINUS ( label ) 
#define BEQ CR( bit, label ) 
#define BEQ CR PLUS { bit, label ) 
#define BEQ CR_MINUS ( bit, label ) 
#define BEQLR 
#define BEQLR PLUS 
#define BEQLR MINUS 
#define BEQLR CR( bit ) 
#define BEQLR CR PLUS ( bit ) 
#define BEQLR CR MINUS ( bit ) 
#define BGE{ label ) 
#define BGE PLUS ( label ) 
#define BGE MINUS ( label ) 
#define BGE CR( bit, label ) 
#define BGE CR PLUS ( bit, label ) 
#define BGE CR_MINUS ( bit, label ) 
#define BGELR 
ttdefine BGELR PLUS 
#define BGELR MINUS 
#define BGELR CR { bit ) 
#define BGELR CR PLUS ( bit ) 
#define BGELR CR MINUS ( bit ) 
#define BGT ( label ) 
#define BGT PLUS { label ) 
#define BGT MINUS ( label ) 
#define BGT CR { bit, label ) 
#define BGT CR PLUS ( bit, label ) 
#define BGT CR_MINUS ( bit, label ) 
#define BGTLR 
#define BGTLR PLUS 
#define BGTLR MINUS 
#define BGTLR CR( bit ) 



add rD, rA, rB; 

add. rD, rA, rB; 

addi rD, rA, (SIMM) ; 

addic. rD, rA, (SIMM); 

addis rD, rA, (SIMM) ; 

and rA, rS, rB; 

and. rA, rS, rB; 

andc rA, rS, rB; 

andc. rA, rS, rB; 

andi. rA, rS, (UIMM) ; 

andis. rA, rS, (UIMM) ; 

ba label; 

bctr; 

bctrl ; 

beq label; 

beq+ label; 

beq- label; 

beq (bit) , label; 

beq+ (bit) , label; 

beq- (bit), label; 

beqlr ; 

beqlr+ ; 

beqlr- ; 

beqlr (bit) ; 

beqlr+ (bit) ; 

beqlr- (bit) ; 

bge label; 

bge+ label; 

bge- label; 

bge (bit) , label; 

bge+ (bit) , label; 

bge- (bit) , label; 

bgelr; 

bgelr+ ; 

bgelr- ; 

bgelr (bit) ; 

bgelr+ (bit) ; 

bgelr- (bit) ; 

bgt label; 

bgt+ label; 

bgt- label; 

bgt (bit) , label; 

bgt-f (bit), label; 

bgt- (bit), label; 

bgtlr; 

bgtlr+; 

bgtlr-; 

bgtlr (bit) ; 
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#define BGTLR CR PLUS ( bit ) 

#define BGTLR CR MINUS ( bit ) 

#define BL{ label ) 

#define BLE ( label ) 

#define BLE PLUS ( label ) 

#define BLE MINUS ( label ) 

#define BLE CR{ bit, label ) 

#define BLE CR PLUS ( bit, label ) 

#define BLE CR_MINUS ( bit, label ) 

#define BLELR 

#define BLELR PLUS 

#define BLELR MINUS 

#define BLELR CR ( bit ) 

#define BLELR CR PLUS ( bit ) 

#define BLELR_CR_MINUS ( bit ) 

#define BLR 

#define BLRL 

ttdefine BLT( label ) 

#define BLT PLUS ( label ) 

#define BLT MINUS { label ) 

#define BLT CR( bit, label ) 

#define BLT CR PLUS ( bit, label ) 

#define BLT CRJMINUS ( bit, label ) 

#define BLTLR 

#define BLTLR PLUS 

#define BLTLR MINUS 

#define BLTLR CR( bit ) 

#define BLTLR CR PLUS ( bit ) 

#define BLTLR CR MINUS ( bit ) 

#define BNE( label ) 

#define BNE PLUS( label ) 

#define BNE MINUS ( label ) 

#define BNE CR( bit, label ) 

#define BNE CR PLUS { bit, label ) 

#define BNE CR_MINUS { bit, label ) 

#define BNELR 

#define BNELR PLUS 

#define BNELR MINUS 

#define BNELR CR( bit ) 

#define BNELR CR PLUS { bit ) 

#def ine BNELR CR MINUS ( bit ) 

#define BR( label ) 

#define CLRLWI ( rA, rS, nbits ) 

#define CLRLWI C( rA, rS, nbits ) 

#define CLRRWI ( rA, rS, nbits ) 

#define CLRRWI_C ( rA, rS,* nbits ) 

ttdefine CMPLW ( rA, rB ) 

#define CMPLW CR ( bit, rA, rB ) 

#def ine CMPLWI ( rA, UIMM ) 

#define CMPLWI CR( bit, rA, UIMM ) 

#define CMPW ( rA, rB ) 

#define CMPW CR( bit, rA, rB ) 

#def ine CMPWI ( rA, SIMM ) 

#define CMPWI_CR{ bit, rA, SIMM ) 

#define DCBF( rA, rB ) 

#define DCBI { rA, rB ) 

#define DCBST( rA, rB ) 

#define DCBT( rA, rB ) 

#define DCBTST{ rA, rB ) 

#if I defined COMPILE_NO_DCBZ 

#define DCBZ ( rA, rB ) 

#else 

#define DCBZ ( rA, rB ) 
#endif 

#define DECR( rD ) 
#define DECR C{ rD ) 
#define DIVW( rD, rA, rB ) 



bgtlr+ (bit) ; 

bgtlr- (bit) ; 

bl label; 

ble label; 

ble+ label; 

ble- label; 

ble (bit), label; 

ble+ (bit), label; 

ble- (bit), label; 

blelr; 

blelr+ ; 

blelr- ; 

blelr (bit) ; 

blelr+ (bit) ; 

blelr- (bit) ; 

blr; 

blrl; 

bit label; 

blt+ label; 

bit- label; 

bit (bit), label; 

blt+ (bit), label; 

bit- (bit), label; 

bltlr; 

bltlr+; 

bltlr-; 

bltlr (bit) ; 

bltlr+ (bit) ; 

bltlr- (bit) ; 

bne label ; 

bne+ label; 

bne- label; 

bne (bit) , label; 

bne+ (bit), label; 

bne- (bit), label; 

bnelr ; 

bnelr+ ; 

bnelr- ; 

bnelr (bit) ; 

bnelr+ (bit) ; 

bnelr- (bit) ; 

b label; 

clrlwi rA, rS, (nbits) ; 
clrlwi. rA, rS, (nbits); 
clrrwi rA, rS, (nbits); 
clrrwi. rA, rS, (nbits); 
cmplw rA, rB; 
cmplw bit, rA, rB; 
ctnplwi rA, (UIMM) ; 
craplwi bit, rA, (UIMM) ; 
cmpw rA, rB; 
cmpw bit, rA, rB; 
cmpwi rA, (SIMM) ; 
cmpwi bit, rA, (SIMM) ; 
dcbf rA, rB; 
dcbi rA, rB; 
dcbst rA, rB; 
debt rA, rB; 
debt St rA, rB; 

dcbz rA, rB; 

nop; 

addi rD, rD, -1; 
addic, rD, rD, -1; 
divw rD, rA, rB; 
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#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
+ (n)-l; 
#def ine 
+(n)-l; 
#define 
+ (n)-l; 
#def ine 



DIVW C( rD, rA, rB ) 

DIVWU( rD, rA, rB ) 

DIVWU C{ rD, rA, rB ) 

EQV( rA, rS, rB ) 

EQV C( rA, rS, rB ) 

EXTLWK rA, rS, n, b ) 

EXTLWI C( rA, rS, n, b ) 

EXTRWK rA, rS, n, b ) 

EXTRWI C{ rA, rS, n, b ) 

FABS( frD, frB ) 

FADD( frD, frA, frB ) 

FADDS( frD, frA, frB ) 

FCMPO( bit, frA, frB ) 

FCMPU( bit, frA, frB ) 

FCTIW( frD, frB ) 

FCTIWZ{ frD, frB ) 

FDIV( frD, frA, frB ) 

FDIVS( frD, frA, frB ) 

FMADD{ frD, frA, frC, frB ) 

FMADDS( frD, frA, frC, frB ) 

FMOV( frD, frB ) 

FMR( frD, frB ) 

FMUL( frD, frA, frB ) 

FMULS{ frD, frA, frB ) 

FMSUB( frD, frA, frC, frB ) 

FMSUBS( frD, frA, frC, frB ) 

FNABS{ frD, frB ) 

FNEG( frD, frB ) 

FNMADD( frD, frA, frC, frB ) 

FNMADDS( frD, frA, frC, frB ) 

FNMSUB( frD, frA, frC, frB ) 

FNMSUBS( frD, frA, frC, frB ) 

FRES( frD, frB ) 

FRSP{ frD, frB ) 

FRSQRTE( frD, frB ) 

FSEL( frD, frA, frC, frB ) 

FSUB( frD, frA, frB ) 

FSUBS( frD, frA, frB ) 

GOTO( label ) 

INCR( rD ) 

INCR C( rD ) 

INSLWI ( rA, rS, n, b ) 

INSLWI_C( rA, rS, n, b ) 

INSRWK rA, rS, n, b ) 

INSRWI_C( rA, rS, n, b ) 

IjA( rD, symbol, SIMM ) 



#define LABEL ( label ) 
#define LBZ( rD, rA, d ) 
#define LBZA( rD, symbol ) 

#define LBZU ( rD, rA, d ) 
#define LBZUX( rD, rA, rB ) 
#define LBZX { rD, rA, rB ) 
#define LFD ( frD, rA, d ) 
#define LFDU( frD, rA, d ) 
#define LFDUX( frD, rA, rB ) 
#define LFDX{ frD, rA, rB ) 
#define LFS ( frD, rA, d ) 
#define LFSA( frD, symbol, rT ) 

#define LFSU{ frD, rA, d ) 
#define LFSUX( frD, rA, rB ) 
#define LFSX( frD, rA, rB ) 



divw. rD, rA, rB; 
divwu rD, rA, rB; 
divwu. rD, rA, rB; 
eqv rA, rS, rB; 
eqv. rA, rS, rB; 
rlwinm rA, rS, (b) , 0, (n)-l; 
rlwinm. rA, rS, (b) , 0, (n) -1; 
rlwinm rA, rS, (b) + (n) , 32- (n) , 31; 
rlwinm, rA, rS, (b)+(n), 32- (n), 31; 
fabs frD, frB; 
fadd frD, frA, frB; 
fadds frD, frA, frB; 
fcmpo bit, frA, frB; 
fcmpu bit, frA, frB; 
fctiw frD, frB; 
fctiwz frD, frB; 
fdiv frD, frA, frB; 
fdivs frD, frA, frB; 
fmadd frD, frA, frC, frB; 
fmadds frD, frA, frC, frB; 
FMR( frD, frB ) 
fmr frD, frB; 
fmul frD, frA, frB; 
fmuls frD, frA, frB; 
fmsub frD, frA, frC, frB; 
fmsubs frD, frA, frC, frB; 
fnabs frD, frB; 
fneg frD, frB; 
fnmadd frD, frA, frC, frB; 
fnmadds frD, frA, frC, frB; 
fnmsub frD, frA, frC, frB; 
fnmsubs frD, frA, frC, frB; 
fres frD, frB; 
frsp frD, frB; 
frsqrte frD, frB; 
fsel frD, frA, frC, frB; 
fsub frD, frA, frB; 
fsubs frD, frA, frB; 
BR( label ) 
addi rD, rD, 1; 
addic. rD, rD, 1; 
rlwimi rA, rS, 32- (b) , 
rlwimi. rA, rS, 32- (b) 



(b), (b) + (n)-l; 
(b) , (b) 



rlwimi rA, rS, 32- ( (b) + (n) ) , (b) , (b) 

rlwimi. rA, rS, 32- ( (b) + (n) ) , (b) , (b) 

addis rD, 0, ( symbol + (SIMM) ) ®ha; \ 
addi rD, rD, (symbol+ (SIMM) ) @1; 
label : 

Ibz rD, (d) (rA) ; 

addis rD, 0, { symbol )@ha; \ 

Ibz rD, (symbol) @1 (rD) ; 

Ibzu rD, (d) (rA) ; 

Ibzux rD, rA, rB; 

Ibzx rD, rA, rB; 

Ifd frD, (d) (rA) ; 

Ifdu frD, (d) (rA) ; 

Ifdux frD, rA, rB; 

Ifdx frD, rA, rB; 

If s frD, (d) (rA) ; 

addis rT, 0, ( symbol )@ha; \ 

Ifs frD, (symbol) @1 (rT) ; 

If su frD, (d) (rA) ; 

Ifsiix frD, rA, rB; 

Ifsx frD, rA, rB; 



34 
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#define LHA( rD, rA, d ) 
#define LHAA( rD, symbol ) 

#define LHAU ( rD, rA, d ) 
#define LHAUX ( rD, rA, rB ) 
#define LHAX{ rD, rA, rB ) 
#define LHZ ( rD, rA, d ) 
#define LHZA( rD, symbol ) 

#define LHZU( rD, rA, d ) 
#define LHZUX( rD, rA, rB ) 
#define LHZX( rD, rA, rB ) 
#def ine LI { rD, SIMM ) 
#define LIS ( rD, SIMM ) 
#define LOAD_COUNT( rD ) 
#define LWZ ( rD, rA, d ) 
#define LWZA( rD, symbol ) 

#define LWZU( rD, rA, d ) 

#define LWZUX( rD, rA, rB ) 

#define LWZX( rD, rA, rB ) 

#define MCRF( crfD, crfS ) 

#define MCRFS{ crfD, crfS ) 
:U #define MFCR( rD ) 

#define MFCTR( rD ) 
O #define MFLR{ rD ) 

.fii #define MFSPR( rD, SPR ) 

"S." #define MR( rA, rS ) 

^define MR C ( rA, rS ) 
Ifl ^define MOV( rA, rS ) 

m #define MOV C{ rA, rS ) 

#define MTCR ( rD ) 
W #define MTCTR( rD ) 

s #define MTFSFI { crfD, IMM ) 

f«^7 # define MTLR{ rD ) 

H ttdefine MTSPR( SPR, rS ) 

W #define MULLI ( rD, rA, SIMM ) 

U-. #define MULLW( rD, rA, rB ) 

V- #define MULLW_C( rD, rA, rB ) 

^define NAND ( rA, rS, rB ) 
O #define NAND_C ( rA, rS, rB ) 

IIH #define NEG ( rD, rA ) 

'"^ ^define NEG_C( rD, rA ) 

^define NOP 

#define NOR( rA, rS, rB ) 
#define NOR_C( rA, rS, rB ) 
#define OR( rA, rS, rB ) 
#define OR C{ rA, rS, rB ) 
#define ORC( rA, rS, rB ) 
^define ORG C{ rA, rS, rB ) 
#define ORI ( rA, rS, UIMM ) 
^define ORIS ( rA, rS, UIMM ) 
#define RETURN 

^define RLWIMI ( rA, rS, SH, MB, ME ) 
^Idefine RLWIMI C( rA, rS, SH, MB, ME ) 
#define RLWINM( rA, rS, SH, MB, ME ) 
# define RLWINM_C ( rA, rS, SH, MB, ME ) 
#define RLVnSIM( rA, rS, rB, MB, ME ) 
#define RLWNM C( rA, rS, rB, MB, ME ) 
^tdefine ROTLW( rA, rS, rB ) 
#define ROTLW C{ rA, rS, rB ) 
#define ROTLW I ( rA, rS, n ) 
#define ROTLWI C{ rA, rS, n ) 
#define ROTRWI ( rA, rS, n ) 
#define ROTRWI C( rA, rS, n ) 
#define SLW{ rA, rS, rB ) 
#define SLW_C( rA, rS, rB ) 



lha rD, id) (rA) ; 

addis rD, 0, (symbol) @ha; \ 

lha rD, (symbol ) @1 (rD) ; 

lhau rD, (d) (rA) ; 

lhaux rD, rA, rB; 

lhax rD, rA, rB; 

Ihz rD, (d) (rA) ; 

addis rD, 0, ( symbol )@ha; \ 

Ihz rD, (symbol) @1 (rD) ; 

Ihzu rD, (d) (rA) ; 

Ihzux rD, rA, rB; 

Ihzx rD, rA, rB; 

11 rD, (SIMM) ; 

lis rD, (SIMM) ; 

mtctr rD; 

Iwz rD, (d) (rA) ; 

addis rD, 0, ( symbol ) ®ha ; \ 

Iwz rD, (symbol) @1 (rD) ; 

Iwzu rD, (d) (rA) ; 

Iwzux rD, rA, rB; 

Iwzx rD, rA, rB; 

mcrf crfD, crfS; 

mcrfs crfD, crfS; 

mfcr rD; 

mfctr rD; 

mflr rD; 

mfspr rD, SPR; 

mr rA, rS; 

or. rA, rS, rS; 

MR{ rA, rS ) 

MR C( rA, rS ) 

mtcr rD; 

mtctr rD; 

mtf sf i (crfD) , (IMM) ; 

mtlr rD; 

mtspr SPR, rS ? 

mulli rD, rA, (SIMM); 

mullw rD, rA, rB; 

mullw. rD, rA, rB; 

nand rA, rS, rB; 

nand. rA, rS, rB; 

neg rD, rA; 

neg. rD, rA; 

nop; 

nor rA, rS, rB; 
nor. rA, rS, rB; 
or rA, rS, rB; 
or, rA, rS, rB; 
ore rA, rS, rB; 
ore. rA, rS, rB; 
ori rA, rS, (UIMM) ; 
oris rA, rS, (UIMM) ; 
BLR 

rlwimi rA, rS, SH, MB, ME; 
rlwimi. rA, rS, SH, MB, ME; 
rlwinm rA, rS, SH, MB, ME; 
rlwinm. rA, rS, SH, MB, ME; 
rlwnm rA, rS, rB, MB, ME; 
rlwnm, rA, rS, rB, MB, ME; 
rlwnm rA, rS, rB, 0, 31; 
rlwnm. rA, rS, rB, 0, 31; 
rlwinm rA, rS, (n) , 0, 31; 
rlwinm. rA, rS, (n) , 0, 31; 
rlwinm rA, rS , 32- (n) , 0, 31 
rlwinm. rA, rS, 32- (n) , 0, 3 
slw rA, rS, rB; 
slw. rA, rS, rB; 
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#define SLWI ( rA, rS; SH ) 
#define SLWI C( rA, rS, SH ) 
#define SRAW( rA, rS, rB ) 
#define SRAW C{ rA, rS, rB ) 
#define SRAWI ( rA, rS, SH ) 
#define SRAWI C( rA, rS, SH ) 
#define SRW { rA, rS, rB ) 
#define SRW C( rA, rS, rB ) 
#define SRWI ( rA, rS, SH ) 
#define SRWI_C ( rA, rS, SH ) 
#define STB ( rS, rA, d ) 
#define STBU( rS, rA, d ) 
#define STBUX( rS, rA, rB ) 
#define STBX( rS, rA, rB ) 
ttdefine STFD ( frD, rA, d ) 
#define STFDU( frD, rA, d ) 
#define STFDUX ( frD, rA, rB ) 
#define STFDX ( frD, rA, rB ) 
#define STFS ( frD, rA, d ) 
#define STFSU( frD, rA, d ) 
#define STFSUX( frD, rA, rB ) 
ttdefine STFSX( frD, rA, rB ) 
ttdefine STH( rS, rA, d ) 
#define STHU( rS, rA, d ) 
#define STHUX( rS, rA, rB ) 
#define STHX( rS, rA, rB ) 
#define STW ( rS, rA, d ) 
#define STWU{ rS, rA, d ) 
#define STWUX( rS, rA, rB ) 
#define STWX( rS, rA, rB ) 
#define SUB ( rD, rA, rB ) 
#define SUB C( rD, rA, rB ) 
#define SUBFIC ( rD, rA, SIMM ) 
#define SUBI ( rD, rA, SIMM ) 
#define SUBIC C( rD, rA, SIMM ) 
#define SUBIS ( rD, rA, SIMM ) 
#define TEST_COUNT ( label ) 
#define XOR{ rA, rS, rB ) 
#define XOR C( rA, rS, rB ) 
#define XORI ( rA, rS, UIMM ) 
#define XORIS( rA, rS, UIMM ) 



VMX instructions 



*/ 

#define BR VMX ALL TRUE ( label ) 
#define BR VMX ALL FALSE { label ) 
#define BR VMX NONE TRUE { label ) 
#define BR VMX SOME FALSE { label ) 
#define BR VMX SOME_TRUE ( label ) 
#define DSS ( STRM ) 
#define DSSALL 
#define DST ( rA, rB, STRM ) 
#define DSTST( rA, rB, STRM ) 
#define DSTT ( rA, rB, STRM ) 
#define DSTSTT( rA, rB, STRM ) 
#define LVEBX( vT, rA, rB ) 
#define LVEHX( vT, rA, rB ) 
#define LVEWX( vT, rA, rB ) 

#if defined ( LITTLE ENDIAN ) 

#define LVSL( vT, rA, rB ) 

#define LVSR( vT, rA, rB ) 

^e 1 se 

#define LVSL( vT, rA, rB ) 

ttdefine LVSR( vT, rA, rB ) 
#endif 



slwi rA, rS, (SH) ; 
slwi. rA, rS, (SH) ; 
sraw rA, rS, rB; 
sraw. rA, rS, rB; 
srawi rA, rS, (SH) ; 
srawi. rA, rS, (SH) ; 
srw rA, rS, rB; 
srw. rA, rS, rB; 
srwi rA, rS, (SH) ; 
srwi. rA, rS, (SH) ; 
stb rS, (d) (rA) ; 
stbu rS, (d) (rA) ; 
stbux rS, rA, rB; 
stbx rS, rA, rB; 
stfd frD, (d) (rA) ; 
stfdu frD, (d) (rA) ; 
stfdux frD, rA, rB; 
stfdx frD, rA, rB; 
stf s frD, (d) (rA) ; 
stfsu frD, (d) (rA) ; 
stfsux frD, rA, rB; 
stfsx frD, rA, rB; 
sth rS, (d) (rA) ; 
sthu rS, (d) (rA) ; 
sthux rS, rA, rB; 
sthx rS, rA, rB; 
stw rS, (d) (rA) ; 
stwu rS, (d) (rA) ; 
stwux rS, rA, rB; 
stwx rS, rA, rB; 
sub rD, rA, rB; 
sub. rD, rA, rB; 
subfic rD, rA, (SIMM) 
subi rD, rA, (SIMM) ; 
subic. rD, rA, (SIMM) 
subis rD, rA, (SIMM) ; 
bdnz label; 
xor rA, rS, rB; 
xor. rA, rS, rB; 
xori rA, rS, (UIMM) ; 
xoris rA, rS, (UIMM) ; 



bt 24, label; 

bt 26, label; 

bt 26, label; 

bf 24, label; 

bf 26, label; 

dss STRM, 0; 

dss 0, 1; 

dst rA, rB, STRM; 

dstst rA, rB, STRM; 

dstt rA, rB, STRM; 

dststt rA, rB, STRM; 

Ivebx vT, rA, rB; 

Ivehx vT, rA, rB; 

Ivewx vT, rA, rB; 



Ivsr vT, rA, rB; 

Ivsl vT, rA, rB; 

Ivsl vT, rA, rB; 

Ivsr vT, rA, rB; 
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#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#define 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



LVX{ vT, rA, rB ) 
LVXL( vT, rA, rB ) 
STVEBX( vS, rA, rB ) 
STVEHX( vS, rA, rB ) 
STVEWX{ vS, rA, rB ) 
STVX( vS, rA, rB ) 
STVXL( vS, rA, rB ) 
VADDFP( vT, vA, vB ) 
VADDSBS( vT, vA, vB 
VADDSHS( vT, vA, vB 
VADDSWS( vT, vA, vB 
VADDUBM( vT, vA, vB 
VADDUBS( vT, vA, vB 
VADDUHM( vT, vA, vB 
VADDUHS( vT, vA, vB 
VADDUWM( vT, vA, vB 
VADDUWS( vT, vA, vB 
VAND( vT, vA, vB ) 
VANDC{ vT, vA, vB ) 
VCMPEQFP( vT, vA, vB ) 
VCMPEQFP C( vT, vA, vB 
VCMPEQUB( vT, vA, vB } 
VCMPEQUB C( vT, vA, vB 
VCMPEQUH( vT, vA, vB ) 
VCMPEQUH C( vT, vA, vB 
VCMPEQUW( vT, vA, vB ) 
VCMPEQUW C{ vT, vA, vB 
VCMPGEFP{ vT, vA, vB ) 
VCMPGEFP C( vT, vA, vB 
VCMPGTFP( vT, vA, vB ) 
VCMPGTFP C( vT, vA, vB 
VCMPGTSB( vT, vA, vB ) 
VCMPGTSB C( vT, vA, vB 
VCMPGTSHl vT, vA, vB ) 
VCMPGTSH C( vT, vA, vB 
VCMPGTSW{ vT, vA, vB ) 
VCMPGTSW C( vT, vA, vB 
VCMPGTUB( vT, vA, vB ) 
VCMPGTUB C( vT, vA, vB 
VCMPGTUH( VT, vA, vB ) 
VCMPGTUH C( vT, vA, vB 
VCMPGTUWC vT, vA, vB ) 
VCMPGTUW C( vT, vA, vB 
VCFSX( vT, vB, UIMM ) 
VCFUX( vT, vB, UIMM ) 
VCTSXS{ vT, vB, UIMM ) 
VCTUXS{ vT, vB, UIMM ) 
VEXPTEFP( vT, vB ) 
VLOGEFP( vT, VB ) 
VMADDFP( vT, vA, vC, vB ) 
VMAXFP( vT, vA, vB ) 
VMAXSB( vT, vA, vB ) 
VMAXSH( vT, vA, vB ) 
VMAXSW( vT, vA, vB ) 
VMAXUB{ vT, vA, vB ) 
VMAXUH( vT, vA, vB ) 
VMAXUW( vT, vA, vB ) 
VMHADDSHS( vD, vA, vB, vC ) 
VMHRADDSHS{ vD, vA, vB, vC ) 
VMINFP( vT, vA, vB ) 
VMINSB( vT, vA, vB ) 
VMINSH( vT, vA, vB ) 
VMINSW( vT, vA, vB ) 
VMINUB( vT, vA, vB ) 
VMINUH( vT, vA, vB ) 
VMINOWC vT, vA, vB ) 



Ivx vT, rA, rB; 
Ivxl vT, rA, rB; 
stvebx vS, rA, rB; 
stvehx vS, rA, rB; 
stvewx vS, rA, rB; 
stvx vS, rA, rB; 
stvxl vS, rA, rB; 
vaddfp vT, vA, vB; 
vaddsbs vT, vA, vB; 
vaddshs vT, vA, vB; 
vaddsws vT, vA, vB; 
vaddubm vT, vA, vB; 
vaddubs vT, vA, vB; 
vadduhm vT, vA, vB; 
vadduhs vT, vA, vB; 
vadduwm vT, vA, vB; 
vadduws vT, vA, vB; 
vand vT, vA, vB; 
vandc vT, vA, vB; 
vcmpeqfp vT, vA, vB; 
vcmpeqfp. vT, vA, vB; 
vcmpequb vT, vA, vB; 
vctnpequb. vT, vA, vB; 
vcmpequh vT, vA, vB; 
vcmpequh. vT, vA, vB; 
vcmpequw vT, vA, vB; 
vcmpequw. vT, vA, vB; 
vcmpgefp vT, vA, vB; 
vcmpgefp. vT, vA, vB; 
vcmpgtfp vT, vA, vB; 
vcmpgtfp. vT, vA, vB; 
vcmpgtsb vT, vA, vB; 
vcmpgt sb . vT , vA , vB ; 
vcmpgtsh vT, vA, vB; 
vcmpgt sh. vT, vA, vB; 
vcmpgt sw vT, vA, vB; 
vcmpgt sw. vT, vA, vB; 
vcmpgtub vT, vA, vB; 
vcmpgtub, vT, vA, vB; 
vcmpgtuh vT, vA, vB; 
vcmpgtuh. vT, vA, vB; 
vcmpgt uw vT, vA, vB; 
vcmpgt uw. vT, vA, vB; 
vcfsx vT, VB, (UIMM); 
vcfux vT, vB, (UIMM) ; 
vctSXS vT, vB, (UIMM) ; 
vctuxs vT, vB, (UIMM) ; 
vexptefp vT, vB; 
vlogefp vT, vB; 
vmaddfp vT, vA, vC, vB; 
vmaxfp vT, vA^ vB; 
vmaxsb vT, vA, vB; 
vmaxsh vT, vA, vB; 
vmaxsw vT, vA, vB; 
vmaxub vT, vA, vB; 
vmaxuh vT, vA, vB; 
vmaxuw vT, vA, vB; 
vmhaddshs vD, vA, vB, vC; 
vmhraddshs vD, vA, vB, vC 
vminfp vT, vA, vB; 
vminsb vT, vA, vB; 
vminsh vT, vA, vB; 
vminsw vT, vA, vB; 
vminub vT, vA, vB; 
vminuh vT, vA, vB; 
vminuw vT, vA, vB; 
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#define VMLADDUHM{ vD, vA, vB, vC ) 
#define VMR ( vD, vS ) 



#if defi 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#else 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#endif 

#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



ned( LITTLE 
VMRGHB( vT, 
VMRGHH{ vT, 
VMRGHW( vT, 
VMRGLB{ vT, 
VMRGLH( vT, 
VMRGLWC vT, 

VMRGHB{ vT, 
VMRGHH( vT, 
VMRGHW( vT, 
VMRGLB( vT, 
VMRGLHC vT, 
VMRGLW( vT, 



END IAN 
vA , vB 
vA, vB 
vA, vB 
vA, vB 
vA, vB 
vA, vB 

vA, vB 
vA, vB 
vA, vB 
vA, vB 
vA, vB 
vA, vB 



VMSUMMBM 
VMStJMSHM 
VMSXJMSHS 
VMSUMUBM 
VMSUMUHM 
VMSUMUHS 
VMULESB( 
VMULESH( 
VMULEUB( 
VMULEUH{ 
VMULOSB ( 
VMULOSH ( 
VMULOUB ( 
VMULOUH ( 
VNMSUBFP 
VNOR( vT 
VNOT( vT 
VOR{ vT, 



{ vT, vA, vB, vC ) 
( vT, vA, vB, vC ) 
( vT, vA, vB, vC ) 
( vT, vA, vB, vC ) 
( vT, vA, vB, vC ) 
( vT, vA, vB, vC ) 

vT, vA, vB ) 

vT, vA, vB ) 

vT, vA, vB ) 

vT, vA, vB ) 

vT, vA, vB ) 

vT, vA, vB ) 

vT, vA, vB ) 

vT, vA, vB ) 
( vT , vA , vC , vB ) 
, vA, vB ) 
, vA ) 

vA, vB ) 



#if defined { LITTLE ENDIAN ) 
#define VPERM( vT, vA, vB, vC ) 
#define VPKUHUM( vT, vA, vB 
#define VPKUHUS ( vT, vA, vB 
#define VPKSHUS ( vT, vA, vB 
#define VPKSHSS ( vT, vA, vB 
#define VPKUWUM( vT, vA, vB 
#define VPKUWUS { vT, vA, vB 
#define VPKSWUS ( vT, vA, vB 
#define VPKSWSS ( vT, vA, vB 
else 

#define VPERM( vT, vA, vB, vC ) 
#define VPKUHUM( vT, vA, vB ^ 
#define VPKUHUS ( vT, vA, vB 
#define VPKSHUS ( vT, vA, vB 
#define VPKSHSS ( vT, vA, vB 
#define VPKUWUM( vT, vA, vB 
#define VPKUWUS { vT, vA, vB 
#define VPKSWUS ( vT, vA, vB 
#define VPKSWSS ( vT, vA, vB 
#endif 

#define VREFP{ vT, vB ) 
#define VRFIM{ vT, vB ) 
#define VRFIN( vT, vB ) 
#define VRFIP( vT, vB ) 
#define VRFIZ ( vT, vB ) 
#define VRLB ( vT, vA, vB ) 
#define VRLH ( vT, vA, vB ) 



vmladduhm vD, vA, vB, vC 



voir vD f 


vS 


/ vi> 


1 


viiixy 




vB , 


vA * 


vm 379 111 


vT, 


vB, 


vA; 


vmrglw 


vT, 


vB, 


vA; 


vmrghb 


vT, 


vB, 


vA; 


vmrghh 


vT, 


vB, 


vA; 


vmrghw 


vT, 


vB, 


vA; 


vmrghb 


vT, 


vA, 


vB; 


vmrghh 


vT, 


vA, 


vB; 


vmrghw 


vT, 


vA, 


vB; 


vmrglb 


vT, 


VA, 


vB; 


vmrglh 


vT, 


VA, 


vB; 


vmrglw 


vT, 


vA, 


vB; 



vmsummbm vT, 
vmsumshm vT, 
vmsumshs vT, 
vmsumubm vT, 
vmsumuhm vT, 
vmsumuhs vT, 
vmulesb vT, 
vmulesh vT, 
vmuleub vT, 
vmuleuh vT, 
vmulosb vT, 
vmulosh vT, 
vmuloub vT, 
vmulouh vT, 
vnmsubfp vT, 
vnor vT, vA, 
vnor vT; vA, 
vor vT, vA, 



vA, vB, vC 
vA, vB, vC 
vA, vB, vC 
vA, vB, vC 
vA, vB, vC 
vA, vB, vC 

vA, vB; 

vA, vB; 

vA , vB ; 

vA, vB; 

vA, vB; 

vA, vB; 

vA, vB; 

vA , vB ; 
vA, vC, vB 
vB; 
vA; 

vB? 



vperm vT, vB, vA, vC; 
vpkuhum vT, vB, vA; 
vpkuhus vT, vB, vA; 
vpkshus vT, vB, vA; 
vpkshss vT, vB, vA; 
vpkuwum vT, vB, vA; 
vpkuwus vT, vB, vA; 
vpkswus vT, vB, vA; 
vpkswss vT, vB, vA; 

vperm vT, vA, vB, vC; 
vpkuhum vT, vA, vB; 
vpkuhus vT, vA, vB; 
vpkshus vT, vA, vB; 
vpkshss vT, vA, vB; 
vpkuwum vT, vA, vB; 
vpkuwus vT, vA, vB; 
vpkswus vT, vA, vB; 
vpkswss vT, vA, vB; 



vrefp vT, vB; 
vrfim vT, vB; 
vrfin vT, vB; 
vrfip vT, vB; 
vrfiz vT, vB; 
vrlb vT, vA, vB; 
vrlh vT, vA, vB; 
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#define VRLW{ vT, vA, vB ) 
#define VRSQRTEFP( vT, vB ) 
#define VSEL{ vT, vA, vB, vC ) 
#define VSL ( vT, vA, vB ) 

#if defined ( LITTLE_ENDIAN ) 
#define VSLDOI ( vT, vA, vB, UIMM ) 
#else 

#define VSLDOI ( vT, vA, vB, UIMM ) 

#endif 

#define VSLB { vT, vA, vB ) 
#define VSLH ( vT, vA, vB ) 
ttdefine VSLO ( vT, vA, vB ) 
#define VSLW( vT, vA, vB ) 
#define VSR{ vT, vA, vB ) 
#define VSRAB ( vT, vA, vB ) 
#define VSRAH{ vT, vA, vB ) 
#define VSRAW{ vT, vA, vB ) 
#define VSRB( vT, vA, vB ) 
#define VSRH( vT, vA, vB ) 
#define VSRO( vT, vA, vB ) 
#define VSRW( vT, vA, vB ) 
#define VSPLTB{ vT, vB, UIMM ) 
#define VSPLTH( vT, vB, UIMM ) 
#define VSPLTW ( vT, vB, UIMM ) 
#define VSPLTISB{ vT, SIMM ) 
#define VSPLTISH( vT, SIMM ) 
#define VSPLTISW{ vT, SIMM ) 
#define VSUBFP( vT, vA, vB ) 
#define VSUBSBS( vT, vA, vB 
#define VSUBSHS( vT, vA, vB 
#define VSUBSWS ( vT, vA, vB 
#define VSUBUBM( vT, vA, vB 
#define VSUBUBS ( vT, vA, vB 
#define VSUBUHM ( vT, vA, vB 
#define VSUBUHS { vT, vA, vB 
#define VSUBUWM( vT, vA, vB 
#define VSUBUWS ( vT, vA, vB 
#define VSUMSWS { vT, vA, vB 
#define VSUM2SWS ( vT, vA, vB ) 
#define VSUM4SBS ( vT, vA, vB ) 
#define VSUM4SHS( vT, vA, vB ) 
#define VSUM4UBS( vT, vA, vB ) 

#if defined ( LITTLE ENDIAN ) 
#define VUPKHSB ( vT, vB ) 
#define VUPKHSH ( vT, vB ) 
#define VUPKLSB ( vT, vB ) 
#define VUPKLSH( vT, vB ) 
^else 

#define VUPKHSB ( vT, vB ) 
#define VUPKHSH ( vT, vB ) 
#define VUPKLSB ( vT, vB ) 
#define VUPKLSH( vT, vB ) 
#endif 

ttdefine VXOR( vT, vA, vB ) 



/* 

* stack and register macros 
*/ 

#define VRSAVE_COND 7 
#undef VOLATILE_rl3 
#define MIN_STACK_ALIGN 16 



vrlw vT, vA, vB; 
vrsqrtefp vT, vB; 
vsel vT, vA, vB, vC; 
vsl vT, vA, vB; 



vsldoi vT, vB, vA, (16 - (UIMM)); 
vsldoi vT, vA, vB, (UIMM) ; 



vslb vT, vA, vB; 
vslh vT, vA, vB; 
vslo vT, vA, vB; 
vslw vT, vA, vB; 
vsr vT, vA, vB; 
vsrab vT^ vA, vB; 
vsrah vT, vA, vB; 
vsraw vT, vA, vB; 
vsrb vT, vA, vB 
vsrh vT, vA, vB 
vsro vT, vA, vB 
vsrw vT, vA, vB 
vspltb vT, vB, C 
vsplth vT, vB, S 
vspltw vT, vB, L 
vspltisb vT, (SIMM) 
vspltish vT, (SIMM) 
vspltisw vT, (SIMM) 
vsubfp vT, vA, vB; 
vsubsbs vT; vA, vB 
vsiibshs vT, vA, 
vsubsws vT, vA, 
vsububm vT, vA, 
vsububs vT, vA, 
vsubuhm vT, vA, 
vsubuhs vT, vA, 
vsubuwm vT, vA, 
vsubuws vT, vA, 
vsumsws vT, vA, 
vsum2sws vT, vA, 



INDEX MUNGE( UIMM ) 
INDEX MUNGE( UIMM ) 
INDEX MUNGE( UIMM ) 



vsum4sbs vT, vA, 
vsum4shs vT, vA, 
vsum4ubs vT, vA, 



vupklsb vT, vB; 

vupklsh vT, vB; 

vupkhsb vT, vB; 

vupkhsh vT, vB; 

vupkhsb vT, vB; 
vupkhsh vT, vB; 
vupklsb vT, vB; 
vupklsh vT, vB; 



vxor vT, vA, vB; 



vB 
vB 
vB 
vB 
vB 
vB 
vB 
vB 
vB 
vB 
vB 
vB 
vB 



/* recommended VR condition bit */ 
/* rl3 volatile or non-volatile */ 
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ttdefine MIN_STACK_ALIGN_MASK (MIN_STACK_ALIGN - 1) 

#define ALIGN STACK ( nbytes ) \ 

(((nbytes) + MIN_STACK_ALIGN_MASK) & '^MIN_STACK_ALIGN_MASK) 

#define LR SAVE OFF 4 

#define FPR_SAVE_OFF (-(32-14)*8) 

#if defined! V0LATILE_rl3 ) 

#define GPR_SAVE_OFF ( FPR_SAVE_OFF - (32-14) *4) 

#define GPR_SAVE_OFF ( FPR_SAVE_OFF - (32-13) *4) 
#endif 

#define CR_SAVE_OFF (GPR_SAVE_OFF - 4) 
#if defined ( BUILD_MAX ) 

#define VRSAVE_SAVE_OFF (CR_SAVE_OFF - 4) 
#if defined ( VOLATILE rl3 ) 

#define ALIGNMENT_PADDING_OFF (VRSAVE_SAVE_OFF - 0) 
#else 

#define ALIGNMENT_PADDING_OFF (VRSAVE_SAVE_OFF - 12) 



O #endif 



C3 

£g #else 

in 



5 ♦ ? 



#define VR SAVE OFF (ALIGNMENT_PADDING__OFF - (32-20) *16) 
#define LAST OFF VR_SAVE_OFF 



#define LAST_OFF CR_SAVE_OFF 
#endif 



W #define REG SAVE SIZE ( -LAST___OFF) 

III #define MAX NARGS 18 

i'l. #define ARGS SIZE (MAX_NARGS * 4) 

#def ine LINK SIZE 8 
+ #define STACK_FRAME_SIZE (REG_SAVE_SIZE + ARGS_SIZE + LINK_SIZE) 

m /* 

* macros to obtain the byte offset into the stack for the last FPR 

* and GPR registers for small temporary storage. 

* FPR_SAVE AREA OFFSET points to an area of 8 * (# of unsaved non-volatile 

* FPR registers) . 

* GPR_SAVE AREA OFFSET points to an area of 4 * (# of unsaved non-volatile 

* GPR registers) . 

* GET FPR SAVE AREA places the start of the FPR save area into a register 

* GET_GPR_SAVE_AREA places the start of the GPR save area into a register 
* 

* For MAX only: 
* 

* VR_SAVE AREA OFFSET points to an area of 16 * (# of unsaved non-volatile 

* VR registers) , 

* GET VR SAVE_AREA places the start of the VR save area into a register 
*/ 

#def ine FPR SAVE AREA OFFSET FPR SAVE OFF 
#define GPR_SAVE_AREA_OFFSET GPR_SAVE_OFF 

#define GET FPR SAVE AREA( ptr ) \ 

addi ptr, sp, FPR_SAVE_AREA_OFFSET ; 

#define GET GPR SAVE AREA( ptr ) \ 

addi ptr, sp, GPR_SAVE_AREA_OFFSET; 

#if defined ( BUILD_MAX ) 
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#def ine VR_SAVE_AREA_OFFSET VR_SAVE_OFF 

#define GET VR SAVE AREA( ptr ) \ 

addi ptr, sp, VR__SAVE_AREA_OFFSET ; 
#endif 

/* 

* if the function creates a stack frame with local storage, 

* LOCAL STORAGE OFFSET is the stack offset to the start of this 

* storage and is guaranteed to have the minimum stack alignment. 
*/ 

#define LOCAL_STORAGE_OFFSET (LINK_SIZE + ARGS_SIZE) 
/* 

* macros to create and destroy a stack frame. 
* 

* CREATE_STACK FRAME [ X] creates a stack frame that can handle up to 

* 18 GPR register arguments and a local storage size <= 

* 32768 - 512 = 32,256 bytes. 
* 

* CREATE_STACK_FRAME_X destroys rO . 
* 

* For CREATE STACK FRAME X, local_nbytes_reg must not be rO. 
f 3 * _ - - 

|5 * Both CREATE STACK FRAME [ X] and DESTROY STACK FRAME should not be 

;lj * called before registers are saved or after they are restored. 

igi * The stack pointer "output from" CREATE STACK_FRAME [_X] must be 

m * the same "input to" DESTROY_STACK_FRAME . 

#S * / 

^9 #define CREATE STACK FRAME ( local nbytes ) \ 

y stwu sp, -ALIGN_STACK( STACK_FRAME_SIZE + (local_nbytes) ) (sp) ? 

#define CREATE STACK FRAME X{ local nbytes reg ) \ 
U addi rO, local nytes reg, (STACK FRAME_SIZE + MIN_STACK_ALIGN_SIZE) ; \ 

I J andi. rO, rO, ~MIN_STACK^IGlSMyiASK; \ 

|]^, stwux sp, sp, rO; 

*P #define DESTROY STACK_FRAME \ 

13 Iwz sp, 0 (sp) ; 



ry 



/* 

* macros to allocate and free space on the user stack. 

* with a fixed alignment of MIN STACK ALIGN. 

* nbytes must be <= (32768 - 432 = 32,336), 

* On return, sp points to a buffer of nbytes bytes. 

*/ 

#define PUSH STACK ( nbytes ) \ 

addi sp, sp, -ALIGN_STACK ( REG_SAVE_SIZE + (nbytes) ) ; 

#define POP_STACK( nbytes ) \ 

addi sp, sp, ALIGN_STACK( REG_SAVE_SIZE + (nbytes) ) ; 

#define ALLOCATE STACK SPACE ( ptr, nbytes ) \ 
PUSH STACK ( nbytes ) \ 
mr ptr, sp; 

#define FREE__STACK_SPACE ( nbytes ) POP_STACK( nbytes ) 
/* 

* macros to create and destroy a stack buffer with a varxable 

* alignment and size. 
* 

* CREATE STACK BUFFER [ X] creates a buffer of size nbytes and alignment 

* byte align on the stack, returning a pointer to the buffer in the 

* GPR bufferp. 
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* buf ferp must be a GPR other than rO and rl (sp) . 

* byte align must be a power of 2 such that 2 <= bYte_align <= 4096. 

* CREATE_STACK_BUFFER destroys rO . 

ic 

* CREATE STACK BUFFER [ X] Stores the original value of the stack pointer 

* below the buffer at offset 0 from the new stack pointer. 
* 

* DESTROY STACK BUFFER sets the stack pointer to the value stored 

* at the address pointed to by the input stack pointer. 
* 

* Both CREATE STACK BUFFER [ X] and DESTROY STACK BUFFER should not be 

* called before registers are saved or after they are restored. 
* 

* The stack pointer "output from" CREATE STACK_BUFFER [_X] must be 

* the same "input to" DESTROY_STACK_BUFFER . 
*/ 

#define CREATE STACK BUFFER ( buf ferp, byte align, nbytes ) \ 

addis bufferp, sp, {-(REG SAVE SIZE + (nbytes)) + 32768)@h; \ 
li rO, ({(byte align) - 1) 1 MIN STACK ALIGN MASK); \ 
addi buf ferp, buf ferp, ( - (REG_SAVE_SIZE + (nbytes) )) ®1 ; \ 
andc buf ferp, buf ferp, rO; \ 
sub rO, buf ferp, sp; \ 
addic rO, rO, -MIN_STACK_ALIGN; \ 
stwux sp, sp, rO; 



Q #define CREATE STACK BUFFER X( buf ferp, byte_align, nbytes_reg ) \ 

sub bufferp, sp, nbytes_reg; \ 

li rO, (((byte align) - 1) | MIN STACK_ALIGN_MASK) ; \ 
%B addi bufferp, bufferp, -REG SAVE_SIZE; \ 

ffl andc bufferp, bufferp, rO; \ 

sub rO, bufferp, sp; \ 
addic rO, rO, -MIN_STACK_ALIGN; \ 
stwux sp, sp, rO; 



C.9 



W 

5 



#def ine DESTROY STACK_BUFFER \ 
Iwz sp, 0 (sp) ; 



* macros to create and destroy the sal cache buffer on the user stack. 



13 * CREATE_STACK_SALCACHE destroys rO . 

* Both CREATE STACK SALCACHE and DESTROY STACK SALCACHE should not be 

* called before registers are saved or after they are restored. 

*/ 

#define CREATE STACK SALCACHE ( cachep ) \ 

CREATE_STACK__BUFFER { cachep, SALCACHE_ALIGN, SALCACHE_ALLOC_SIZE ) 

#define DESTROY__STACK_SALCACHE DESTROY_STACK_BUFFER 
/* 

* macros for saving and restoring non-volatile 

* floating point registers (FPRs) 
*/ 



#def ine 


SAVE 


f 14 


SR 


fl4 ( 


stfd ) 






#define 


SAVE 


fl4 


fis" 


■ SR 


fl4 


fl5( 


Stfd 


) 


#def ine 


SAVE 


fl4 


fl6 


SR 


fl4 


fl6( 


stfd 


) 


#def ine 


SAVE 


fl4 


fl7 


SR 


fl4 


fl7( 


stfd 
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#def ine 


SAVE 


fl4 


fl8 


SR 


fl4 


fl8( 


Stfd 
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#def ine 


SAVE 


fl4 


fl9 


SR 


fl4 


fl9( 


stfd 
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#define 


SAVE 


fl4 


f20 


SR 


fl4 


f20( 


stfd 
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#def ine 


SAVE 


fl4 


f21 


SR 


fl4 


f21 { 


stfd 
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#def ine 


SAVE 


fl4 


f22 


SR 


fl4 


f22 { 


stfd 
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#def ine 


SAVE 


fl4 


f23 


SR 


fl4 
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stfd 
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#def ine 


SAVE 


fl4 


f24 


SR 


fl4 


f 24 ( 


stfd 
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#def ine 


SAVE 


fl4 


f25 


SR 


fl4 


f25( 


stfd 
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#def ine 


SAVE 


fl4 


f26 


SR 


fl4 


f26( 


stfd 


) 
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#d.ef ine 
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/* 

* macros common to both FPR save and resi 
*/ 

#define SR_f 14 ( opcode ) \ 
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opcode fl4, (FPR_SAVE OFF + 17*8) (sp) ; 
#define SR fl4_fl5( opcode ) \ 

opcode fl5, (FPR_SAVE__OFF + 16*8) (sp) ; \ 
SR f 14 ( opcode ) 
#define SR f 14_f 16 ( opcode ) \ 

opcode fl6, (FPR SAVE_OFF + 15*8) (sp) ; \ 
SR f 14 f 15 { opcode ) 
#define SR fl4_fl7{ opcode ) \ 

opcode fl7, (FPR SAVE_OFF + 14*8) (sp) ; \ 
SR f 14 f 16 ( opcode ) 
#define SR f 14_f 18 ( opcode ) \ 

opcode fl8, (FPR SAVE_OFF + 13*8) (sp) ; \ 
SR fl4 fl7( opcode ) 
#define SR f 14__f 19 ( opcode ) \ 

opcode f 19, (FPR SAVE_OFF + 12*8) (sp) ; \ 
SR fl4 fl8( opcode ) 
#define SR f 14_f 20 ( opcode ) \ 

opcode f20, (FPR SAVE_OFF + 11*8) (sp) ; \ 
SR fl4 fl9( opcode ) 
#define SR fl4_f21( opcode ) \ 

opcode f21, (FPR SAVE_OFF + 10*8) (sp) ; \ 
SR fl4 f20 ( opcode ) 
#define SR f 14_f 22 ( opcode ) \ 
1^ opcode f22, (FPR SAVE_OFF + 9*8) (sp) ; \ 

Q SR fl4 f21( opcode ) 

#define SR f 14_f 23 ( opcode ) \ 
-J opcode f23, (FPR SAVE_OFF + 8*8) (sp) ; \ 

W SR f 14 f 22 ( opcode ) 

\3 #def ine SR f 14_f 24 ( opcode ) \ 

m opcode f24, (FPR SAVE_OFF + 7*8) (sp) ; \ 

j-'f SR f 14 f 23 ( opcode ) 

W #define SR fl4_f25( opcode ) \ 

opcode f25, (FPR SAVE_OFF + 6*8) (sp) ; \ 
T SR fl4 f24( opcode ) 

#define SR f 14_f 26 ( opcode ) \ 
l3 opcode f26, (FPR SAVE_OFF + 5*8) (sp) ; \ 

IsiJ SR fl4 f25( opcode ) 

#define SR f 14_f 27 ( opcode ) \ 

opcode f27, (FPR SAVE_OFF + 4*8) (sp) ; \ 
ftG SR fl4 f26( opcode ) 

#define SR fl4_f28( opcode ) \ 

opcode f28, (FPR SAVE_OFF + 3*8) (sp) ; \ 
SR f 14 f27 ( opcode ) 
#define SR fl4_f29( opcode ) \ 

opcode f29, (FPR SAVE_OFF + 2*8) (sp) ; \ 
SR f 14 f 28 ( opcode ) 
#define SR fl4_f30( opcode ) \ 

opcode f30, (FPR SAVE_OFF + 1*8) (sp) ; \ 
SR f 14 f 29 ( opcode ) 
ttdefine SR fl4_f31( opcode ) \ 

opcode f31, (FPR SAVE_OFF) (sp) ; \ 
SR_fl4_f30( opcode ) 

/* 

* macros for saving and restoring non- volatile 

* general purpose registers (GPRs) 
*/ 

#if defined ( V0LATILE_rl3 ) 
#define SAVE rl3 

#define SAVE rl3 rl4 SR rl4 ( stw ) 
#define SAVE rl3 rl5 SR rl4 rl5 ( stw ) 
#define SAVE rl3 rl6 SR rl4 rl6 ( stw ) 
#def ine SAVE rl3 rl7 SR rl4 rl7 ( stw ) 
#define SAVE rl3 rl8 SR rl4 rl8 ( stw ) 
#define SAVE rl3 rl9 SR rl4 rl9( stw ) 
#define SAVE rl3 r20 SR rl4 r20 ( stw ) 
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#define REST rl3 r26 SR rl3 r26 ( Iwz ) 

#define REST rl3 r27 SR rl3 r27 ( Iwz ) 

#define REST rl3 r28 SR rl3 r28( Iwz ) 

#define REST rl3 r29 SR rl3 r29( Iwz ) 

#define REST rl3 r30 SR rl3 r30 ( Iwz ) 

#define REST_rl3_r31 SR_rl3_r31 ( Iwz ) 

/* 

* macros common to both GPR save and restore 
*/ 

#def ine SR rl3 ( opcode ) \ 

opcode rl3, (GPR_SAVE OFF + 18*4) (sp) ; 
#def ine SR rl3_rl4 ( opcode ) \ 

opcode rl4, {GPR_SAVE_OFF + 17*4) (sp) ; \ 
SR rl3 ( opcode ) 
#def ine SR rl3_rl5 ( opcode ) \ 

opcode rl5, (GPR SAVE_OFF + 16*4) (sp) ; \ 
SR rl3 rl4 ( opcode ) 
#define SR rl3_rl6 ( opcode ) \ 

opcode rl6, (GPR SAVE_OFF + 15*4) (sp) ; \ 
SR rl3 rl5 ( opcode ) 
#define SR rl3_rl7 ( opcode ) \ 

opcode rl7, (GPR SAVE_OFF + 14*4) (sp) ; \ 
SR rl3 rl6 ( opcode ) 
13 #define SR rl3_rl8( opcode ) \ 

r% opcode rl8, (GPR SAVE_OFF + 13*4) (sp) ; \ 

"Z SR rl3 rl7( opcode ) 

#define SR rl3_rl9( opcode ) \ 
1:0 opcode rl9, (GPR SAVE_OFF + 12*4) (sp) ; \ 

SR rl3 rl8 ( opcode ) 
#define SR rl3_r20 ( opcode ) \ 
^^V^ opcode r2 0, (GPR SAVE_OFF + 11*4) (sp) ; \ 

III SR rl3 rl9( opcode ) 

#define SR rl3_r21( opcode ) \ 
X^, opcode r21, (GPR SAVE_OFF + 10*4) (sp) ; \ 

W SR rl3 r2 0( opcode ) 

III #define SR rl3_r22 ( opcode ) \ 

£^ opcode r22, (GPR SAVE_OFF + 9*4) (sp) ; \ 

SR rl3 r21( opcode ) 
#def ine SR rl3_r23 ( opcode } \ 
13 opcode r23, (GPR SAVE_OFF + 8*4) (sp) ; \ 

SR rl3 r22 ( opcode ) 
#def ine SR rl3_r24 ( opcode ) \ 

opcode r24, (GPR SAVE_OFF + 7*4) (sp) ; \ 
SR rl3 r23( opcode ) 
#define SR rl3_r25 ( opcode ) \ 

opcode r2 5, (GPR SAVE_OFF + 6*4) (sp) ; \ 
SR rl3 r24 ( opcode ) 
#define SR rl3_r26 ( opcode ) \ 

opcode r26, (GPR SAVE_OFF + 5*4) (sp) ; \ 
SR rl3 r25 ( opcode ) 
#define SR rl3_r27( opcode ) \ 

opcode r27, (GPR SAVE_OFF + 4*4) (sp) ; \ 
SR rl3 r26 ( opcode ) 
#define SR rl3_r2 8( opcode ) \ 

opcode r28, (GPR SAVE_OFF + 3*4) (sp) ; \ 
SR rl3 r27 ( opcode ) 
#define SR rl3_r29( opcode ) \ 

opcode r29, (GPR SAVE_OFF + 2*4) (sp) ; \ 
SR rl3 r28 ( opcode ) 
#define SR rl3_r30( opcode ) \ 

opcode r30, (GPR SAVE_OFF + 1*4) (sp) ; \ 
SR rl3 r29( opcode ) 
#define SR rl3_r31( opcode ) \ 

opcode r31, (GPR SAVE_OFF) (sp) ; \ 
SR_rl3_r30( opcode ) 
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#endif 
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/* end V0LATILE_rl3 



% y 



17*4) (sp 
16*4) (sp 



); \ 



/* 

* macros common to both GPR save and restore 

*/ 

#def ine SR rl4 ( opcode ) \ 

opcode rl4, (GPR_SAVE OFF + 
#def ine SR rl4_rl5 ( opcode ) \ 

opcode rl5, (GPR_SAVE_OFF + 

SR rl4 ( opcode ) 
#define SR rl4_rl6{ opcode ) \ 

opcode rl6, (GPR SAVE_OFF + 

SR rl4 rl5 ( opcode ) 
#def ine SR rl4_rl7 ( opcode ) \ 

opcode rl7, (GPR SAVE_OFF + 

SR rl4 rl6 { opcode ) 
#def ine SR rl4_rl8 ( opcode ) \ 

opcode rl8, (GPR SAVE_OFF + 13*4) (sp) ; \ 

SR rl4 rl7( opcode ) 
#define SR rl4_rl9( opcode ) \ 

opcode rl9, (GPR SAVE_OFF + 

SR rl4 rl8 ( opcode ) 
#define SR rl4_r20( opcode ) \ 

opcode r20, (GPR SAVE_OFF + 

SR rl4 rl9 ( opcode ) 
#define SR rl4_r21( opcode ) \ 

opcode r21, (GPR SAVE_OFF + 

SR rl4 r20( opcode ) 
#def ine SR_rl4_r22 ( opcode ) \ 



15*4) (sp) ; \ 



14*4) (sp) ; \ 



12*4) (sp) ; \ 



11*4) (sp) ; \ 



10*4) (sp) ; \ 
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"(GPR SAVE_OFF + 7*4) (sp) ; \ 

Opcode ) 
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"(GPR SAVE_OFF + 6*4) (sp) ; \ 

opcode ) 

r26( opcode ) \ 
"(GPR SAVE_OFF + 5*4) (sp) ; \ 

opcode ) 

r27( opcode ) \ 
"(GPR SAVE_OFF + 4*4) (sp) ; \ 

Opcode ) 
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"(GPR SAVE_OFF + 3*4) (sp) ; \ 

opcode ) 
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opcode ) 

r30( opcode ) \ 
"(GPR SAVE_OFF + 1*4) (sp) ; \ 

opcode ) 

r31( opcode ) \ 
"(GPR SAVE_OFF) (sp) ; \ 
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* macros common to both GPR save and restore 
*/ 

#define SR rl5 ( opcode ) \ 

opcode rl5, {GPR_SAVE OFF + 16*4) (sp) ; 
#def ine SR rl5_rl6 ( opcode ) \ 

opcode rl6, (GPR_SAVE_OFF + 15*4) (sp) ; \ 

SR rl5 ( opcode ) 
#def ine SR rl5__rl7 ( opcode ) \ 

opcode rl7, (GPR SAVE_OFF + 14*4) (sp) ; \ 

SR rl5 rl6 ( opcode ) 
#def ine SR rl5___rl8 ( opcode ) \ 

opcode rl8, (GPR SAVE_OFF + 13*4) (sp) ; \ 

SR rl5 rl7 ( opcode ) 
#define SR rl5_rl9( opcode ) \ 

opcode rl9, (GPR SAVE_OFF + 12*4) (sp) ; \ 

SR rl5 rl8 ( opcode ) 
#define SR rl5__r20( opcode ) \ 

opcode r20, (GPR SAVE_OFF + 11*4) (sp) ; \ 

SR rl5 rl9 ( opcode ) 
#define SR rl5_r21( opcode ) \ 

opcode r21, (GPR SAVE_OFF + 10*4) (sp) ; \ 

SR rl5 r20 ( opcode ) 
#def ine SR rl5_r22 ( opcode ) \ 

opcode r22, (GPR SAVE_OFF + 9*4) (sp) ; \ 

SR rl5 r21( opcode ) 
#def ine SR rl5_r23 ( opcode ) \ 

opcode r23, (GPR SAVS_OFF + 8*4) (sp) ; \ 

SR rl5 r22 ( opcode ) 
#def ine SR rl5_r24 ( opcode ) \ 

opcode r24, (GPR SAVE_OFF + 7*4) (sp) ; \ 

SR rl5 r23 ( opcode ) 
#define SR rl5_r2 5( opcode ) \ 

opcode r25, (GPR SAVE_OFF + 6*4) (sp) ; \ 

SR rl5 r24 ( opcode ) 
#define SR rl5_r2 6 ( opcode ) \ 

opcode r26, (GPR SAVE_OFF + 5*4) (sp) ; \ 

SR rl5 r25( opcode ) 
#define SR rl5_r27 ( opcode ) \ 

opcode r27, (GPR SAVE_OFF + 4*4) (sp) ; \ 

SR rl5 r2 6( opcode ) 
#define SR rl5_r28( opcode ) \ 

opcode r2 8, (GPR SAVE_OFF + 3*4) (sp) ; \ 

SR rl5 r27( opcode ) 
tdefine SR rl5_r29( opcode ) \ 

opcode r29, (GPR SAVE_OFF + 2*4) (sp) ; \ 

SR rl5 r28 ( opcode ) 
#define SR rl5_r3 0{ opcode ) \ 

opcode r30, (GPR SAVE__OFF + 1*4) (sp) ; \ 

SR rl5 r29( opcode ) 
#define SR rl5_r31( opcode ) \ 

opcode r31, (GPR SAVE_OFF) (sp) ; \ 

SR_rl5_r30( opcode ) 



#def ine 


SAVE 


rl6 


SR 


rl6 ( 


stw ) 




#def ine 


SAVE 


rl6 


rl7 


" SR 


rl6 


rl7( 


stw 


#def ine 


SAVE 


rl6 


rl8 


SR 


rl6 


rl8( 


stw 


#def ine 


SAVE 


rl6 


rl9 


SR 
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rl9( 


stw 


#def ine 


SAVE 
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r20 


SR 
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r20 ( 


stw 


#def ine 


SAVE 
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r21 


SR 


rl6 


r21 ( 


stw 


#def ine 


SAVE 
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r22 


SR 
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r22 ( 


stw 


#def ine 


SAVE 
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r23 


SR 
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r23 { 


stw 


#def ine 


SAVE 
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r24 


SR 
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r24 ( 


stw 


#def ine 


SAVE 
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SR 


rl6 


r25 ( 


stw 


#def ine 


SAVE 
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r2 6 


SR 
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r26 ( 


stw 


#def ine 


SAVE 


rl6 


r27 


SR 


rl6 


r27( 


stw 


#def ine 


SAVE 
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r28 


SR 
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r28( 
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#def ine 
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r29 


SR 
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r29( 


stw 
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#def ine 
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SR 
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r30{ 


stw 


#def ine 


SAVE_ 


_rl6_ 


_r31 


SR_ 




,2:31 ( 


stw 


#def ine 
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Iwz ) 




#def ine 
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Iwz 


#def ine 
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#def ine 
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#def ine 
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Iwz 


#def ine 
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SR 
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Iwz 


#def ine 


REST 


Tie 


r22 


SR 
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Iwz 


#def ine 
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#def ine 
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#def ine 
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#def ine 
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#def ine 
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#def ine 
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#def ine 
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#def ine 
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#def ine 
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macros comTUon to both GPR save and restore 



m 



ry 



*/ 

#def ine SR rl6 ( opcode ) \ 

opcode rl6, (GPR_SAVE OFF + 

#define SR rl6_rl7( opcode ) \ 
opcode rl7, (GPR_SAVE_OFF + 
SR rl6 ( opcode ) 

#define SR rl6_rl8( opcode ) \ 
opcode rl8, (GPR SAVE_OFF + 
SR rl6 rl7( opcode ) 

#define SR rl6__rl9( opcode ) \ 
opcode rl9, (GPR SAVE_OFF + 
SR rl6 rl8 ( opcode ) 

#define SR rl6_r20( opcode ) \ 
opcode r2 0, (GPR SAVE_OFF + 
SR rl6 rl9( opcode ) 

#define SR rl6_r21( opcode ) \ 
opcode r21, (GPR SAVE_OFF + 
SR rl6 r20 ( opcode ) 

#define SR rl6_r22 ( opcode ) \ 
opcode r22, (GPR SAVE_OFF + 
SR rl6 r21( opcode ) 

#def ine SR rl6_r2 3 ( opcode ) \ 
opcode r23, (GPR SAVE__OFF + 
SR rl6 r22 { opcode ) 

#define SR rl6_r24 ( opcode ) \ 
opcode r24, (GPR SAVE_OFF + 
SR rl6 r23 ( opcode ) 

#define SR rl6_r25( opcode ) \ 
opcode r25, {GPR SAVE_OFF + 
SR rl6 r24 ( opcode ) 

idefine SR rl6_r26 ( opcode ) \ 
opcode r26, (GPR SAVE_OFF + 
SR rl6 r25( opcode ) 

#define SR rl6_r27( opcode ) \ 
opcode r27, (GPR SAVE_OFF + 
SR rl6 r26 ( opcode ) 

#define SR rl6_r28( opcode ) \ 
opcode r28, (GPR SAVE_OFF + 
SR rl6 r27 ( opcode ) 

#define SR rl6_r29 ( opcode ) \ 
opcode r29, (GPR SAVE_OFF + 
SR rl6 r28( opcode ) 

#define SR rl6_r30( opcode ) \ 
opcode r30, (GPR SAVE_OFF + 
SR_rl6_r29 ( opcode ) 



15*4) (sp) ; 
14*4) (sp) ; \ 



13*4 



12*4 



11*4 



10*4 



9*4) 



8*4) 



7*4) 



6*4) 



5*4) 



4*4) 



3*4) 



2*4) 



1*4) 



) (sp) ; \ 
) (sp) ; \ 
) (sp) ; \ 
) (sp) ; \ 
(sp); \ 
(sp) ; \ 
(sp); \ 
(sp); \ 
(sp); \ 
(sp) ; \ 
(sp) ; \ 
(sp); \ 
(sp) ; \ 
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ttdefine SR rl6_r31{ opcode ) \ 

opcode r31, (GPR SAVE_OFF) (sp) ; \ 
SR_rl6_r30( opcode ) 

#if defined ( BUILD_MAX ) 



macros for saving and restoring non-volatile 
vector registers (Ws) 
(uses rO as scratch register) 



*/ 








ffderine 


SAVE 


v20 


SR ' 


#def ine 


SAVE 


v20 


v21 


#def ine 


SAVE 


v20 


v22 


#def ine 


SAVE 


v20 


v23 


#def ine 


SAVE 


v2 0 


v24 


#def ine 


SAVE 


v20 


v25 


_LL J _ ^ • 

ffaetine 


SAVE 


v20 


V2d 


#def ine 


SAVE 


v20 


v27 


#def ine 


SAVE 


v20 


v28 


#define 


SAVE 


v20 


v29 


#def ine 


SAVE 


v20 


v30 


#def ine 


SAVE_ 


_v20_ 


_v31 


#def ine 


REST 


v20 


SR 


#def ine 


REST 


v20 


v21 


#def ine 


REST 


v20 


v22 


#def ine 


REST 


v20 


v23 


#def ine 


REST 


v20 


v24 


#def ine 


REST 


v20 


v25 


#def ine 


REST 


v2 0 


v2 6 


#def ine 


REST 


v2 0 


v27 


#def ine 


REST 


v2 0 


v2 8 


#define 


REST 


v2 0 


v2 9 


#def ine 


REST 


v20 


v30 


#def ine 


REST 


v20 


v31 


/* 









SR v20 v21 
SR v20 v22 
SR v20 v23 
SR v20 v24 
SR v20 v25 
SR v20 v26 
SR v20 v27 
SR v20 v28 
SR v20 v29 
SR v20 v30 
SR_v20_v31 

v20( Ivx ) 
' SR v20 v21 
SR v20 v22 
SR v20 v23 
SR v20 v24 
SR v20 v25 
SR v2 0 v26 
SR v2 0 v27 
SR v20 v28 
SR v20 v29 
SR v20 v30 
SR v2 0 v31 



stvx 
stvx 
stvx 
stvx 
stvx 
stvx 
stvx 
stvx 
stvx 
stvx 
stvx 



Ivx 

Ivx 
Ivx 
Ivx 
Ivx 
Ivx 
Ivx 
Ivx 
Ivx 
Ivx 
Ivx 



* macros common to both VR save and restore 

* (uses rO as scratch register) 
*/ 

#define SR v20 ( opcode ) \ 

li rO, (VR SAVE_OFF + 11*16); \ 

opcode v2 0, sp, rO ; 
#define SR v20 v21( opcode ) \ 

li rO, (VR SAVE_OFF + 10*16); \ 

opcode v21, sp, rO; \ 

SR v2 0 ( opcode ) 
#define SR v20 v22 ( opcode ) \ 

li rO, (VR SAVE_OFF + 9*16); \ 

opcode v22, sp, rO; \ 

SR v2 0 v2 1( opcode ) 
#define SR v20 v23 ( opcode ) \ 

li rO, (VR SAVE_OFF + 8*16) ; \ 

opcode v23, sp, rO; \ 

SR v20 v22 ( opcode ) 
#define SR v20 v24 ( opcode ) \ 

li rO, (VR SAVE_OFF + 7*16); \ 

opcode v24, sp, rO; \ 

SR v20 v23 ( opcode ) 
#def ine SR v20 v25 ( opcode ) \ 

li rO, (VR SAVE_OFF + 6*16) ; \ 

Opcode v2 5, sp, rO; \ 

SR v2 0 v24 ( opcode ) 
#define SR v20 v26 ( opcode ) \ 

li rO, (VR SAVE_OFF + 5*16); \ 

opcode v26, sp, rO; \ 
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SR v20 v25( opcode ) 
#define SR v20 v27( opcode ) \ 

li rO, (VR SAVE_OFF + 4*16); \ 

opcode v27, sp, rO; \ 

SR v20 v26{ opcode ) 
#define SR v20 v28( opcode ) \ 

li rO, (VR SAVE_OFF + 3*16); \ 

opcode v28, sp, rO; \ 

SR v20 v27( opcode ) 
#def ine SR v20 v29 ( opcode ) \ 

li rO, (VR SAVE_OFF + 2*16); \ 

opcode v29, sp, rO; \ 

SR v20 v28( opcode ) 
#define SR v20 v30( opcode ) \ 

li rO, (VR SAVE_OFF + 1*16); \ 

opcode v30, sp, rO; \ 

SR v2 0 v2 9 ( opcode ) 
#define SR v20 v31 ( opcode ) \ 

li rO, (VR SAVE_OFF) ; \ 

opcode v31, sp, rO; \ 

SR_v20_v30( opcode ) 

/* 

* macros for saving, updating and restoring VRSAVE and saving and 
n * restoring non-volatile vector registers (vO - v31) 

|£ * (destroys rO and CRO field of CR) 

^ite= * / 

^0 #define NON VOLATILE VR TEST { last vreg ) \ 

andi. rO, rO, ((-1 << (31 - (last_vreg) ) ) & OxOf ff ) ; 

^& #define RECORD vO vl5 ( last_vreg ) \ 

W oris rO, rO, ((-1 « (15 - (last_vreg) ) ) & Oxf ff f ) ; \ 

mtspr % VRSAVE, rO ; 

^ #define RECORD vl6 v31 ( last_vreg ) \ 

13 oris rO, rO, Oxffff; \ 

ori rO, rO, ((-1 << (31 - (last_vreg) ) ) & Oxffff); \ 
r"^' mtspr %VRSAVE, rO; 

£ #def ine USE vO vl5 ( cond, last_vreg ) \ 

^ mfspr rO, % VRSAVE; \ 

zf. cmplwi (cond) , rO, 0; \ 

lyi beq (cond), PC OFFSET ( 8 ); \ 

stw rO, VRSAVE_SAVE OFF(sp); \ 

RECORD_vO_vl5 { last_vreg ) 

#define USE vl6 vl9 { cond, last__vreg ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond) , rO, 0; \ 
beq (cond), PC OFFSET ( 8 ); \ 
stw rO, VRSAVE SAVE OFF(sp); \ 
RECORD_vl6__v31 ( last_vreg ) 

#define FREE_vO_vl9( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET ( 8 ); \ 
Iwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

/* 

* user-callable macros 
V 

#def ine USE THRU vO ( cond ) USE vO vl5 ( cond, 0 ) 
#define USE THRU vl ( cond ) USE vO vlS ( cond, 1 ) 
#def ine USE THRU v2 ( cond ) USE vO vlS ( cond, 2 ) 
#def ine USE THRU v3 ( cond ) USE vO vl5 ( cond, 3 ) 
#define USE_THRU__v4 ( cond ) USE_vO_vl5 ( cond, 4 ) 
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#def ine 
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THRU 
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cond. 


5 ) 


#def ine 


USE 


THRU 


v6 ( 


cond ) 


USE 


vO 


vl5 ( 


cond. 
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#def ine 
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USE 
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USE 
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#def ine 
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USE 
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13 ) 


#def ine 
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THRU 


Vl4 ( 
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USE 
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cond. 


14 ) 


#def ine 


USE 


THRU 
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cond , 
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#def ine 


USE 


THRU 
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USE 
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cond. 


16 ) 


#def ine 


USE 


THRU 
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USE 


vl^ 


; vl9( 


cond. 


17 ) 
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USE 
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USE 
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USE_ 
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USE_ 
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#def ine 


USE 
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\ 










mfspr rO 


%VRSAVE; 


\ 












cmplwi (cond) , 


rO, 


0; \ 
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beq (cond), PC OFFSET ( 32 ) ; 
\ 

stw rO, VRSAVE SAVE OFF{sp); 
NON_VOLATILE VR TEST { 20 ) 
beq PC OFFSET (16) ; 

\ 

SAVE v20 

cmpwi (cond) , rO, 0x7fff ; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31( 20 ) 

#define USE THRU v21 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET (40) ; 

\ 

stw rO, VRSAVE SAVE OFF(sp); 
NON_VOLATILE VR TEST ( 21 ) 
beq PC_0FFSET(24) ; 
\ 

SAVE v2 0 v21 

cmpwi (cond), rO , 0x7fff; 
mfspr rO, %VRSAVE; 
REC0RD_vl6_v31 ( 21 ) 

#def ine USE THRU v22 { cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET (48) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); 
NON__VOLATILE VR TEST ( 22 ) 
beq PC_OFFSET{32) ; 
\ 

SAVE v20 v22 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6__v31( 22 ) 

#def ine USE THRU v23 ( cond ) \ 
mfspr rO, % VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET (56) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); 
NON_VOLATILE VR TEST { 23 ) 
t>eq PC_OFFSET(40) ; 
\ 



/* cond set to equal if VRSAVE = o */ 



/* v20 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v20 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v21 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v21 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v22 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v22 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v23 in use? */ \ 

/* no, cond is set to greater than */ 
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SAVE v20 v23 

cmpwi (cond) , rO, 0x7fff ; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 23 ) 

#def ine USE THRU v24 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond) , rO, 0; \ 
beq (cond), PC_OFFSET{64) ; 
\ 

Stw rO, VRSAVE SAVE OFF(sp); \ 
NON_VOLATILE VR TEST ( 24 ) 
beq PC_OFFSET{48) ; 
\ 

SAVE v20 v24 

cmpwi (cond) , rO, 0x7fff ; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 24 ) 

#define USE THRU v25 { cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi ( cond) , rO , 0 ; \ 
beq (cond), PC_0FFSET(72) ; 

\ 

stw rO, VRSAVE SAVE OFF(sp); \ 
NON_VOIjATILE VR TEST( 25 ) 
beq PC_OFFSET(56) ; 
\ 

SAVE v20 v25 

cmpwi ( cond ) , r 0 , 0x7 f f f ; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 25 ) 

#def ine USE THRU v26 ( cond ) \ 
mfspr rO, % VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET (80 ) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); \ 
NON__VOIATILE VR TEST { 26 ) 
beq PC_OFFSET(64) ; 

\ 

SAVE v20 v26 

cmpwi (cond), rO , 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31( 26 ) 

#def ine USE THRU v27 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET (88) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); \ 
NON_VOIATILE VR TEST ( 27 ) 
beq PC_OPFSET(72) ; 
\ 

SAVE v20 v27 

cmpwi (cond) , rO, 0x7fff ; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31( 27 ) 

#define USE THRU v28 ( cond ) \ 
mfspr rO, % VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET (96) ; 
\ 

stw rO, VRSAVE_SAVE_OFF(sp) ; \ 



3/9/2001 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v23 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v24 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v24 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v25 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v25 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v26 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v26 in use */ 



/* cond set to equal if VRSAVE 



/* v20 - v2 7 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v27 in use */ 



/* cond set to equal if VRSAVE = 0 */ 
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NON_VOLATILE VR TEST ( 28 ) 

beq PC_OFFSET(80) ; 

\ 

SAVE v2 0 v2 8 

cmpwi (cond) , rO, OxVfff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31( 28 ) 

#define USE THRU v29 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET(104) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); 
NON_VOLATILE VR TEST ( 29 ) 
beq PC__0FFSET(B8) ; 

\ 

SAVE v20 v29 

cmpwi (cond), rO, Ox7fff; 
mfspr rO, %VRSAVE; 
RECORD__vl6_v31 ( 29 ) 

#define USE THRU v30( cond ) \ 
mfspr rO, % VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET (112) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); 
NON_VOLATILE VR TEST ( 30 ) 
beq PC_OFFSET(96) ; 
\ 

SAVE v20 v3 0 

cmpwi (cond) , rO , 0x7fff ; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31( 3 0 ) 

#define USE THRU v31( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET(120) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); 
NON_VOLATILE VR TEST ( 31 ) 
beq PC OFFSET ( 1 04 ); 
\ 

SAVE v2 0 v31 

cmpwi (cond), rO, Ox7fff; 
mfspr rO, %VRSAVE; 
RECORD vl6 v31( 31 ) 



/* v20 - v28 in use? */ \ 

/* no, cond is set to greater tlian 



/ 
/ \ 



/* leaves a negative value in rO 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v28 in use */ 



/* cond set to equal if VRSAVE 



/* v20 - v29 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v29 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v30 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v3 0 in use */ 



/* cond set to equal if VRSAVE 



/* v20 - v31 in use? */ \ 

/* no, cond is set to greater than */ 



/* leaves a negative value in rO 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v31 in use */ 



\ 
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#define FREE THRU vl7 ( cond ) FREE vO vl9 ( cond ) 

#def ine FREE THRU vl8 ( cond ) FREE vO vl9 ( cond ) 

#define FREE_THRU_vl9 ( cond ) FREE__vO_vl9 ( cond ) 

#define FREE_THRU__v2 0 { cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (20); \ 
bgt (cond), PC__OFFSET (12) ; \ 
REST v2 0 ; \ 

Iwz rO, VRSAVE__SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO ; 

#define FREE_THRU_v21 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (28) ; \ 
bgt (cond), PC OFFSET (20) ; \ 
REST v20 v21; \ 

Iwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v22 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (36) ; \ 
^ bgt (cond), PC OFFSET (28) ; \ 

^3 REST v20 v2 2; \ 

i^: Iwz rO, VRSAVE_SAVE_OFF(sp) ; \ 

mtspr %VRSAVE, rO; 

#define FREE_THRU_v23 ( cond ) \ 
li rO, 0; \ 
l^: beq (cond), PC OFFSET (44) ; \ 

%?J= bgt (cond), PC OFFSET (36) ; \ 

III REST v20 v23; \ 

''"^ Iwz rO, VRSAVE_SAVE_OFF(sp) ; \ 

mtspr %VRSAVE, rO; 



5 



m 



I si #def ine FREE_THRU_v24 ( cond ) \ 

.?T li rO, 0; \ 

beq (cond), PC OFFSET ( 52 ) ; \ 
bgt (cond), PC OFFSET (44) ; \ 
REST v20 v24; \ 
Iwz rO, VRSAVE_SAVE_OFF(sp) ; 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v25 { cond ) \ 
li rO, 0; \ 

beq (cond), PCOFFSET{60); \ 
bgt (cond), PC OFFSET (52) \ 
REST v20 v25; \ 
Iwz rO, VRSAVE_SAVE_OFF(sp) ; 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v2 6 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (68) ; \ 
bgt (cond), PC OFFSET (60) ; \ 
REST v20 v26; \ 
Iwz rO, VRSAVE_SAVE_OFF(sp) ; 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v2 7 ( cond ) \ 

li rO, 0; \ 

beq (cond), PC OFFSET (76); \ 
bgt (cond), PC OFFSET (68); \ 
REST V20 V27; \ 
Iwz rO, VRSAVE_SAVE_OFF(sp) ; 
mtspr %VRSAVE, rO; 
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#define FREE_THRU_v2 8 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (84) ; \ 
bgt (cond), PC OFFSET (76) ; \ 
REST v20 v2 8; \ 

Iwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v2 9 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (92) ; \ 
bgt (cond), PC OFFSET ( 84 ) ; \ 
REST V2 0 V2 9; \ 

Iwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v3 0 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (100) ; \ 
bgt (cond), PC OFFSET (92) ; \ 
REST v20 v3 0; \ 

Iwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

t% 

1"% #define FREE_THRU_v3 1 ( cond ) \ 

:% li rO, 0; \ 

^ beq (cond), PC OFFSET (108) ; \ 

VO bgt (cond), PC OFFSET (100) ; \ 

f^i REST v2 0 v31; \ 

Iwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO ; 



m 



m 



#endif /* end BUILD_MAX */ 



/* 

* macros to save and restore the CR register 
i'l * (uses rO as scratch register) 

*/ 

*p #define SAVE CR \ 

f*^ mfcr rO; \ 

stw rO, CR_SAVE_OFF(sp) ; 



#define REST CR \ 

Iwz rO, CR_SAVE_OFF(sp) ; \ 
mtcr rO; 

•i 

* (uses rO as scratch register) 
*/ 

ttdefine SAVE LR \ 
mflr rO; \ 

stw rO, LR_SAVE_OFF(sp) ; 

#define REST LR \ 

Iwz rO, LR_SAVE_OFF(sp) ; \ 
mtlr rO; 

#endif /* end COMPILE_C */ 

/* 

* macros for declaring GPR, FPR and VMX registers 
*/ 

/* 

* declare rO 
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#define DECIiARE_rO 
/* 

* r3 declare set 
*/ 

#define DECLARE r3 
#define DECLARE r3 r4 
#define DECLARE r3 r5 
#define DECLARE r3 r6 
#define DECLARE r3 r7 
#define DECLARE r3 r8 
#define DECLARE r3 r9 
#define DECLARE r3 rlO 
#define DECLARE r3 rll 
#define DECLARE r3 rl2 
#define DECLARE r3 rl3 
#define DECLARE r3 rl4 
#define DECLARE r3 rl5 
#define DECLARE r3 rl6 
#define DECLARE r3 rl7 
#define DECLARE r3 rl8 
5 . #def ine DECLARE r3 rl9 

#define DECLARE r3 r20 
13 #define DECLARE r3 r21 

#def ine DECLARE r3 r22 
#define DECLARE r3 r23 
#define DECLARE r3 r24 
W #define DECLARE r3 r25 

#define DECLARE r3 r26 
#define DECLARE r3 r27 
#define DECLARE r3 r28 
Wi #define DECLARE r3 r29 

#define DECLARE r3 r30 
l^^ #define DECLARE__r3_r31 

m I* 

IjI, * r4 declare set 

^ */ 

#define DECLARE r4 

Q ^define DECLARE r4 r5 

f|s #define DECLARE r4 r6 

#define DECLARE r4 r? 

#define DECLARE r4 r8 

#define DECLARE r4 r9 

#define DECLARE r4 rlO 

#define DECLARE r4 rll 

#define DECLARE r4 rl2 

#define DECLARE r4 rl3 

#define DECLARE r4 rl4 

#define DECLARE r4 rl5 

#define DECLARE r4 rl6 

#define DECLARE r4 rl7 

ttdefine DECLARE r4 rlB 

#define DECLARE r4 rl9 

ttdefine DECLARE r4 r20 

#define DECLARE r4 r21 

#define DECLARE r4 r22 

#define DECLARE r4 r23 

#define DECLARE r4 r24 

#define DECLARE r4 r25 

#define DECLARE r4 r26 

#define DECLARE r4 r27 

#define DECLARE r4 r28 

#define DECLARE r4 r2 9 

#define DECLARE r4 r3 0 

#define DECLARE r4 r31 
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/* 

* r5 declare set 
*/ 

#define DECLARE r5 
#define DECLARE r5 r6 
#define DECLARE r5 r7 
#define DECLARE r5 r8 
#define DECLARE r5 r9 
#define DECLARE r5 rlO 
#define DECLARE r5 rll 
#define DECLARE r5 rl2 
#define DECLARE r5 rl3 
#define DECLARE r5 rl4 
ttdefine DECLARE r5 rl5 
ttdefine DECLARE r5 rl6 
#define DECLARE r5 rl7 
#define DECLARE r5 rl8 
#define DECLARE r5 rl9 
#define DECLARE r5 r20 
#define DECLARE r5 r21 
#define DECLARE r5 r22 
#define DECLARE r5 r23 
h^' #define DECLARE r5 r24 

ri #define DECLARE r5 r25 

#define DECLARE r5 r26 
#define DECLARE r5 r27 
#define DECLARE r5 r28 
#define DECLARE r5 r29 
#define DECLARE r5 r30 
#define DECLARE r5 rSl 



1*^ 



m 



5 ~ 



* 

* r6 declare set 
s */ 

13 #define DECLARE r6 

#define DECLARE r6 rV 
#define DECLARE r6 r8 
f*-^ #define DECLARE r6 r9 

#define DECLARE r6 rlO 
#define DECLARE r6 rll 
#define DECLARE r6 rl2 
#define DECLARE r6 rl3 
#define DECLARE r6 rl4 
#define DECLARE r6 rl5 
#define DECLARE r6 rl6 
#define DECLARE r6 rl7 
#define DECLARE r6 rl8 
#define DECLARE r6 rl9 
#define DECLARE r6 r20 
#define DECLARE r6 r21 
#def ine DECLARE r6 r22 
#define DECLARE r6 r23 
#define DECLARE r6 r24 
#define DECLARE r6 r25 
#define DECLARE r6 r26 
#define DECLARE r6 r27 
#define DECLARE r6 r28 
#define DECLARE r6 r29 
#define DECLARE r6 r30 
#define DECLARE_r€_r31 

/* 

* r7 declare set 
*/ 

#define DECLARE r7 
#define DECLARE r7 rS 
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#define DECLARE r7 r9 
#define DECLARE r7 rlO 
#define DECLARE r7 rll 
#define DECLARE r7 rl2 
#define DECLARE r7 rl3 
#define DECLARE r7 rl4 
#define DECLARE r7 rl5 
#define DECLARE r7 rl6 
#define DECLARE r7 rl7 
#define DECLARE r7 rl8 
#define DECLARE r7 rl9 
#define DECLARE r7 r20 
#define DECLARE r7 r21 

#define DECLARE r7 r22 

#define DECLARE r7 r23 

#define DECLARE r7 r24 

#define DECLARE r7 r25 

#define DECLARE r7 r26 

#define DECLARE r7 r27 

#define DECLARE r7 r28 

#define DECLARE r7 r29 

#define DECLARE r7 r30 
U #define DECLARE_r7_r31 



/ 



C3 * r8 declare set 

*/ 

#define DECLARE r8 
^ #define DECLARE r8 r9 

IS #define DECLARE r8 rlO 

I'll ttdefine DECLARE r8 rll 

i% #define DECLARE r8 rl2 

#define DECLARE r8 rl3 
s #def ine DECLARE r8 rl4 

fH: #define DECLARE r8 rl5 

#define DECLARE r8 rl6 
^>^~ #define DECLARE rS rl7 

1^ #define DECLARE r8 rl8 

ttdefine DECLARE r8 rl9 
!C" #define DECLARE r8 r20 

C3 #define DECLARE rS r21 

fli #define DECLARE rS r22 

#define DECLARE rS r23 
#define DECLARE rS r24 
#define DECLARE rS r25 
#define DECLARE r8 r26 
#define DECLARE r8 r27 
#define DECLARE r8 r28 
#define DECLARE r8 r29 
#define DECLARE r8 r30 
#define DECLARE_r8_r31 



/* 

* r9 declare set 
*/ 

#define DECLARE r9 
#define DECLARE r9 rlO 
#def ine DECLARE r9 rll 
#define DECLARE r9 rl2 
#define DECLARE r9 rl3 
#define DECLARE r9 rl4 
ttdefine DECLARE r9 rl5 
#define DECLARE r9 rl6 
#define DECLARE r9 rl7 
#define DECLARE r9 rl8 
#define DECLARE r9 rl9 
#define DECLARE_r9__r20 
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ttdefine DECLARE r9 r21 
#define DECLARE r9 r22 
#define DECLARE r9 r23 
#define DECLARE r9 r24 
#define DECLARE r9 r25 
#define DECLARE r9 r26 
#define DECLARE r9 r27 
#define DECLARE r9 r2 8 
#define DECLARE r9 r29 
#define DECLARE r9 r30 
#define DECLARE_r9_r31 



^0 



* rlO declare set 
*/ 

#define DECLARE rlO 
#define DECLARE rlO rll 
#define DECLARE rlO rl2 
#define DECLARE rlO rl3 
#define DECLARE rlO rl4 
#define DECLARE rlO rl5 
#define DECLARE rlO rl6 
#def ine DECLARE rlO rl7 
tdefine DECLARE rlO rl8 
W #define DECLARE rlO rl9 

£3 #define DECLARE rlO r20 

#define DECLARE rlO r21 
#define DECLARE rlO r22 
ttdefine DECLARE rlO r23 
IB #define DECLARE rlO r24 

fi. ttdefine DECLARE rlO r25 

#define DECLARE rlO r26 
#define DECLARE rlO r27 
s #define DECLARE rlO r28 

#define DECLARE rlO r29 
H #define DECLARE rlO r3 0 

#define DECLARE_rlO_r31 

/* 

* rll declare set 

O */ 

m #define DECLARE rll 

#define DECLARE rll rl2 
#define DECLARE rll rl3 
#define DECLARE rll rl4 
#define DECLARE rll rl5 
#define DECLARE rll rl6 
#define DECLARE rll rl7 
#define DECLARE rll rl8 
#define DECLARE rll rl9 
#define DECLARE rll r20 
#define DECLARE rll r21 
#define DECLARE rll r22 
#define DECLARE rll r23 
#define DECLARE rll r24 
#define DECLARE rll r25 
#define DECLARE rll r26 
#define DECLARE rll r27 
#define DECLARE rll r28 
#define DECLARE rll r29 
#define DECLARE rll r30 
#define DECLARE_rll_r31 

/* 

* rl2 declare set 
*/ 

#define DECLARE_rl2 
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#define DECLARE rl2 rl3 
#define DECLARE rl2 rl4 
#define DECLARE rl2 rl5 
#define DECLARE rl2 rl6 
#define DECLARE rl2 rl7 
#define DECLARE rl2 rl8 
#define DECLARE rl2 rl9 
#define DECLARE rl2 r20 
#define DECLARE rl2 r21 
#define DECLARE rl2 r22 
#define DECLARE rl2 r23 
#define DECLARE rl2 r24 
#define DECLARE rl2 r25 
#define DECLARE rl2 r26 
#define DECLARE rl2 r27 
#define DECLARE rl2 r28 
#define DECLARE rl2 r29 
#define DECLARE rl2 r30 
#define DECLARE__rl2_r31 



/* 

* rl3 declare set 
U */ 

is.:. #define DECLARE rl3 

!r? ttdefine DECLARE rl3 rl4 

tJ #define DECLARE rl3 rl5 

Lfl #define DECLARE rl3 rl6 

'5= #define DECLARE rl3 rl7 

#define DECLARE rl3 rl8 
Wi #define DECLARE rl3 rl9 

#define DECLARE rl3 r20 

#define DECLARE rl3 r21 

#define DECLARE rl3 r22 
s #define DECLARE rl3 r23 

#define DECLARE rl3 r24 
1% #define DECLARE rl3 r25 

#define DECLARE rl3 r26 
Iss #define DECLARE rl3 r27 

"r^ #define DECLARE rl3 r28 

#define DECLARE rl3 r2 9 
O #define DECLARE rl3 r30 

lU #define DECLARE_rl3_r31 

/* 

* rl4 declare set 
*/ 

#define DECLARE rl4 
#define DECLARE rl4 rl5 
^define DECLARE rl4 rl6 
4ldefine DECLARE rl4 rl7 
^idefine DECLARE rl4 rl8 
#define DECLARE rl4 rl9 
#define DECLARE rl4 r20 
#define DECLARE rl4 r21 
#define DECLARE rl4 r22 
#define DECLARE rl4 r23 
#define DECLARE rl4 r24 
#define DECLARE rl4 r25 
#define DECLARE rl4 r26 
#define DECLARE rl4 r27 
#define DECLARE rl4 r28 
#define DECLARE rl4 r29 
#define DECLARE rl4 r30 
. #def ine DECLARE_rl4_r31 

/* 

* rl5 declare set 
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#define DECLARE rl5 
#define DECLARE rl5 rl6 
#define DECLARE rl5 rl7 
#define DECLARE rl5 rl8 
#define DECLARE rl5 rl9 
#define DECLARE rl5 r20 
#define DECLARE rl5 r21 
#define DECLARE rl5 r22 
#define DECLARE rl5 r23 
#define DECLARE rl5 r24 
#define DECLARE rl5 r25 
ttdefine DECLARE rl5 r26 
#define DECLARE rl5 r27 
#define DECLARE rl5 r28 
#define DECLARE rl5 r29 
#define DECLARE rl5 r30 
#define DECLARE_rl5_r31 



/* 

* rl6 declare set 
*/ 

|, #define DECLARE rl6 

#define DECLARE rl6 rl7 
%«^^ #define DECLARE rl6 rl8 

C3 #define DECLARE rl6 rl9 

#define DECLARE rl6 r20 
#define DECLARE rl6 r21 
m #define DECLARE rl6 r22 

Igi #define DECLARE rl6 r23 

i% #define DECLARE rl6 r24 

y^l. #define DECLARE rl6 r2 5 

W #define DECLARE rl6 r26 

s #define DECLARE rl6 r27 

#define DECLARE rl6 r28 
H #define DECLARE rl6 r29 

llj #define DECLARE rl6 r3 0 

1^ #define DECLARE_rl6_r31 

/* 

C3 * rl7 declare set 

rij */ 

#define DECLARE rl7 
#define DECLARE rl7 rl8 
#define DECLARE rl7 rl9 
#define DECLARE rl7 r20 
#define DECLARE rl7 r21 
#define DECLARE rl7 r22 
#define DECLARE rl7 r23 
#define DECLARE rl7 r24 
#define DECLARE rl7 r25 
#define DECLARE rl7 r26 
#define DECLARE rl7 r27 
#define DECLARE rl7 r28 
ttdefine DECLARE rl7 r29 
#define DECLARE rl7 r30 
#define DECLARE_rl7_r31 

/* 

* rl8 declare set 
*/ 

#define DECLARE rl8 
#define DECLARE rl8 rl9 
#define DECLARE rl8 r2 0 
#define DECLARE rl8 r21 
#define DECLARE rl8 r22 
#define DECLARE_rl8_r23 
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#define DECLARE rl8 r24 
#define DECLARE rl8 r25 
#define DECLARE rl8 r26 
#define DECLARE rlB r27 
#define DECLARE rlB r28 
#define DECLARE rl8 r2 9 
#define DECLARE rl8 r30 
#define DECLARE_rl8_r31 

/* 

* rl9 declare set 

*/ 

#define DECLARE rl9 
#define DECLARE rl9 r20 
#define DECLARE rl9 r21 
#define DECLARE rl9 r22 
#define DECLARE rl9 r23 
#define DECLARE rl9 r24 
#define DECLARE rl9 r25 
#define DECLARE rl9 r26 
#define DECLARE rl9 r27 
#define DECLARE rl9 r28 
#define DECLARE rl9 r29 
#define DECLARE rl9 r30 
O #define DECLARE_rl9_r31 

/* 

* FPR single precision declare set 

^fii */ 

#define DECLARE fO 
#define DECLARE fO fl 
W #define DECLARE fO f2 

IJ #define DECLARE fO f3 

#define DECLARE fO f4 
#def ine DECLARE f 0 f 5 
U #define DECLARE fO f6 

yi #define DECLARE fO f7 

#define DECLARE fO f8 
#define DECLARE fO f9 
#define DECLARE fO flO 
#define DECLARE fO fll 
#define DECLARE fO fl2 
#define DECLARE fO fl3 
#define DECLARE fO fl4 
#define DECLARE fO fl5 
#define DECLARE fO fl6 
#define DECLARE fO fl7 
#define DECLARE fO fl8 
#define DECLARE fO fl9 
#define DECLARE fO f20 
#define DECLARE fO f21 
#define DECLARE fO f22 
#define DECLARE fO f23 
ttdefine DECLARE fO f24 
#define DECLARE fO f25 
#define DECLARE fO f26 
#define DECLARE fO f27 
ttdefine DECLARE fO f28 
ttdefine DECLARE fO f29 
ttdefine DECLARE fO f30 
#define DECLARE_f 0_f 31 

/* 

* FPR double precision declare set 
*/ 

#define DECLARE dO 
#define DECLARE__dO_dl 
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1* 



ft 



15 



#def ine 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 

DECLARE 



do d2 
do d3 
do d4 
do d5 
dO d6 
do d7 
do d8 
do d9 
do dlO 
do dll 
dO dl2 
dO dl3 
do dl4 
dO dl5 
do dl6 
do dl7 
do dl8 
do dl9 
do d20 
dO d21 
do d22 
do d23 
do d24 
do d25 
do d26 
do d27 
do d28 
do d29 
do d30 
dO d31 



* VMX declare set 
*/ 

#define DECLARE vO 
#define DECLARE vO vl 
#define DECLARE vO v2 
#define DECLARE vO v3 
#define DECLARE vO v4 
#define DECLARE vO v5 
#define DECLARE vO v6 
#define DECLARE vO v7 
#define DECLARE vO v8 
#def ine DECLARE vO v9 
#define DECLARE vO vlO 
#define DECLARE vO vll 
#define DECLARE vO vl2 
#define DECLARE vO vl3 
#def ine DECLARE vO vl4 
#def ine DECLARE vO vl5 
#define DECLARE vO vl6 
#define DECLARE vO vl7 
#define DECLARE vO vl8 
#define DECLARE vO vl9 
#define DECLARE vO v20 
#define DECLARE vO v21 
#define DECLARE vO v22 
#def ine DECLARE vO v23 
#define DECLARE vO v24 
#def ine DECLARE vO v25 
#define DECLARE vO v26 
#define DECLARE vO v27 
#define DECLARE vO v2 8 
ttdefine DECLARE vO v29 
#define DECLARE vO v30 
ttdefine DECLARE_vO_v31 
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#endif /* end SALPPC_INC */ 

/* 

* 

* 

* END OF FILE salppc. inc 



^0 



2 V 
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MC Standard Algorithms -- PPC Macro language Version 



File Name: SVE3 8BIT.MAC 

Description: Sum the elements of 3 signed byte vectors 

each of length N. 

sve3_8bit ( char *A, char *B, char *C, long *SUM, int N ) 

Restrictions: A, B and C must all be 16-byte aligned. 

N must be a multiple of 16 and >= 16. 

Mercury Computer Systems, Inc. 
Copyright (c) 2 000 All rights reserved 

Revision Date Engineer Reason 

0,0 000605 fpl Created 



1^ 

m 

W 

s 



# include " salppc . inc " 
/** 

Input parameters 



#define 


A 


r3 


#def ine 


B 


r4 


#def ine 


C 


r5 


#def ine 


SUM 


r6 


#def ine 


N 


r7 


#def ine 


AOp 


A 


#define 


BOp 


B 


#def ine 


COp 


C 


#def ine 


Alp 


r8 


#def ine 


Blp 


r9 


#def ine 


Clp 


no 


#d6f ine 


index 


rll 


#def ine 


zero 


vO 


#def ine 


one 


vl 


#def ine 


aO 


v2 


#define 


al 


v3 


#def ine 


bO 


v4 


#def ine 


bl 


v5 


#define 


cO 


v6 


#def ine 


cl 


V7 


#def ine 


sumO 


v8 


#def ine 


suml 


v9 


#def ine 


sum2 


vlO 



FUNC_PROLOG 

ENTRY_5( sve3_8bit. A, B, C, SUM, N ) 

USE_THRU_vlO ( VRSAVE_COND ) 

LI { index, 0 ) 

VXOR{ zero, zero, zero ) 
ADDIC C{ N, N, -32 ) 
LVX( aO, AOp, index ) 

VSPLTISB( one, 1 ) 
LVX{ bO, BOp, index ) 
ADDK Alp, AOp, 16 ) 

VXOR( sumO, sumO, sumO ) 
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ADDK Blp, BOp, 16 ) 

VXOR{ suml, suml, suml ) 
ADDK Clp, COp, 16 ) 

VXOR( sum2, sum2, sum2 ) 
BLT( dol6 ) 

LABEL ( loop ) 

ADDIC C( N, -32 ) 
LVX( cO, cop, index ) 

VMSUMMBM( sumO, aO, one, SumO ) 
LVX( al. Alp, index ) 

VMSUMMBM( suml, bO, one, suml ) 
LVX( bl, Blp, index ) 

VMSUMMBM( sum2, cO, one, sum2 ) 
LVX( cl, Clp, index ) 
ADDK index, index, 32 ) 

VMSUMMBM { sumO, al, one, sumO ) 
LVX( aO, AOp, index ) 

VMSUMMBM ( sural, bl, one, suml ) 
LVX( bO, BOp, index ) 

VMSUMMBM { sum2, cl, one, sum2 ) 
BGE( loop ) 

CMPWK N, -32 ) 
BEQ{ combine ) 

%0 LABEL ( dol6 ) 

-'S: LVX{ cO, COp, index ) 

VMSUMMBM ( sumO, aO, one, sumO ) 
fff: VMSUMMBM { suml, bO, one, suml ) 

fgl VMSUMMBM ( sum2, cO, one, sum2 ) 

I 

LABEL ( combine ) 
s VADDUWM( sumO, sumO, suml ) 

f*=y VADDUWM( sumO, sumO , sum2 ) 

Jt's VSUMSWS( sumO, sumO , zero ) 

VSPLTW( sumO, sumO, 3 ) 
1*5= STVEWX( sumO, 0, SUM ) 

£ FREE THRU_vlO( VRSAVE_COND ) 

^i^- RETURN 

FUNC EPILOG 
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************************************************ 

******************************************* 

--** Majority Voter/Sync Control logic TOP LEVEL Module: voter_sync . vhd 
* * 

Description: This Module is the top level of the 
--** Majority Voter and Raceway Sync Logic 
* * 

- - * * Author 
--** Date 
--** Date 
- _** 



Steven Imperial i 
7-05-2000 

10-25-2000 Modified cable clock and sync 



This PLD handles the following functions; 

1) Raceway clock source and skew control 

2) Raceway sync generation 

3) Majority voter logic 

4) I2C reset logic 

5) Inverter for the HS LED signal 



1^. LIBRARY IEEE; 

USE lEEE.STD LOGIC 1164. ALL; 
USE STD.TEXTIO.ALL; 
use ieee.std logic arith.all; 
use ieee. std_logic_unsigned.all ; 

ENTITY voter__sync IS 



elk 66 pal6 


IN 


std 


logic; 


elk 33 pall 


IN 


std 


logic; 


reset 0 


IN 


std 


logic; 


X rst brd 0 


OUT 


std 


logic; 


X rst brd 1 


OUT 


std 


logic ; 


pll rng sel 


OUT 


std 


logic; 


pll freq sel 


OUT 


std 


logic; 


fb sk sel 


OUT 


std 


logic; 


f b dev by 2 0 


OUT 


std 


logic; 


main sk selO 


OUT 


std 


logic; 


main sk sell 


OUT 


std 


logic; 


jk sk selO 


OUT 


std 


logic ; 


jk sk sell 


OUT 


std 


logic; 


jxl elk oe 


.OUT 


std 


logic ; 


jx2 elk oe 


OUT 


std 


logic; 


sw elk mode2_l 


IN 


std 


logic vector (2 downto 1) ; 


mux elk selO 


OUT 


std 


logic ; 


Tnux_clk_sell 


OUT 


std 


logic ; 



testn 




IN 


std 


logic; 


tmsO 




IN 


std 


logic; 


rsync x ndO 




.OUT 


std 


logic; 


rsync x ndl 




OUT 


std 


logic ; 


rsync x nd2 




OUT 


std 


logic; 


rsync x nd3 




OUT 


std 


logic ; 


rsync x pxbO 




OUT 


std 


logic; 


rsync_x_xbar 




OUT 


std_ 


logic; 


ndO resetreq 


0 


IN 


std 


logic ; 


ndl resetreq 


0 


IN 


std 


logic; 


nd2 resetreq 


0 


IN 


std 


logic; 


nd3 resetreq 


0 


IN 


std 


logic ; 


pq resetreq__0 




IN 


std 


logic; 


resetvote_0 




OUT 


std 


logic ; 



ndO_ekstpreqndO__0 :IN std_logic; 
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b 



m 
m 
w 



ndO ckstpreqndl 0 
ndO ckstpreqnd2 0 
ndO ckstpreqndS 0 
ndO ckstpreqpq 0 
ndl ckstpreqndO 0 
ndl ckstpreqndl 0 
ndl ckstpreqnd2 0 
ndl ckstpreqnd3 0 
ndl ckstpreqpq 0 
nd2 CkstpreqndO 0 
nd2 ckstpreqndl 0 
nd2 ckstpreqnd2 0 
nd2 ckstpreqndS 0 
nd2 ckstpreqpq 0 
nd3 ckstpreqndO 0 
nd3 ckstpreqndl 0 
nd3 ckstpreqnd2 0 
nd3 ckstpreqnd3 0 
nd3 ckstpreqpq 0 
pq ckstpreqndO 0 
pq ckstpreqndl 0 
pq ckstpreqnd2 0 
pq ckstpreqnd3 0 
pq ckstpreqpq_0 
pq ckstopin 0 
ndO ckstopin 0 
ndl ckstopin 0 
nd2 ckstopin 0 
nd3_ckstopin_0 

i2c__rst_0 
sda 
scl 

pxbO hs__led 
hs_led 
) ; 



I ±Vi 


of- r\ locfic ; 


. IN 


std IoqIc; 




afd locric; 




c?td loQic; 




Gffi loaic; 


• T"KF 


cfd locric ; 


. TM 

: xiM 


c?l~d lociic ; 


• TTOr 


Qt-ri loaic ; 


. XlM 


ci^fi loaic; 




c?l"d loaic ; 




cshrS loaic; 


. TXT 


c!t"(i loaic ; 


. TKT 


cii-d loaic ; 


• TTJ 


eitd loaic : 


. TKT 


Qt-d loaic 7 


. TTJ 


csi-fi loaic; 


. TKr 


cst-fi loaic ; 


' TTJ 


etd loaic; 


:IN 


Std logic; 


:IN 


std logic; 


: J.JM 


c!t"rl loaic ; 


. TTJ 

: J.JN 


loaic; 


. T-vr 


cii-f3 loaic ; 


. TTVT 

: IJM 




: OUT 


std logic; 


:OUT 


std logic; 


:OUT 


std logic; 


:OUT 


std logic; 


:OUT 


std_logic; 


IN 


std logic; 


INOUT 


std logic; 


INOUT 


std logic; 


'IN 


std logic; 


:OUT 


std logic 



m 



END voter_sync; 

ARCHITECTURE TOP_riEVEL_voter_sync OF voter_sync IS 

*************************************************************** 
""*!********************************************************** 

.?°'rs°":^^*?:f*™***- 



.*****************^ 
.***************************** 



********************************** 



COMPONENT m voter PORT ( 
elk 66 pal6 
reset 0 
request 0 0 
request 1 0 
request 2 0 
requests 0 
request 4 0 
healthyO 1 
healthyl 1 
healthy2 1 
healthyS 1 
healthy4 1 
voteout_0 

END COMPONENT; 



IN 
IN 
IN 
IN 
IN 
IN 
IN 
IN 
IN 
IN 
IN 
IN 
OUT 



std 
std 
std 
std 
Std 
std 
std 
std 
std 
std 
std 
std 
std 



logic ; 
logic; 
logic; 
logic ; 
logic; 
logic; 
logic; 
logic ; 
logic; 
logic; 
logic ; 
logic; 
logic) ; 



._********************** 



******************** 



********************* 
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Signals to Connect All of the Components Together 
*********************************************** 



Signal healthyO 1 
Signal healthyl 1 
Signal healthy2 1 
Signal healthyS 1 
Signal healthy4_l 
Signal sync dl 
Signal sync d2 
Signal sync d3 



std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 

Signal ndO ckstop_0, ndl_ckstop_0 , nd2_ckstop_0 , nd3__ckstop_0 , 
:std logic; 

:std logic; 
:std logic; 
:std logic; 
: std_logic; 



pq_ckstop_0 



Signal g ndO resetreq 0 

Signal g ndl resetreq 0 

Signal g nd2 resetreq 0 

Signal g_nd3_resetreq__0 



BEGIN 

^Sg, __*************************************************************** 

Jj' --** Begin Architecture Here (Instantiations) 

%3 _«*************************************************************** 

tf^ ndO_ckstop voter : m_voter PORT Map ( 

III elk 66 pal 6, 

I'jl; reset 0, 

ndO ckstpreqndO 0, 
^ ndl CkstpreqndO 0, 

I'l nd2 CkstpreqndO 0, 

5 5 nd3 CkstpreqndO 0, 

J-^" pq ckstpreqndO_0, 

healthyO 1, 
^g: healthyl 1, 

l£] heal thy 2 1, 

heal thy 3 1, 

healthy4 1, 

ndO_ckstop_0) ; 



ndl_ckstop voter : m__voter PORT Map( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqndl 0, 
ndl Ckstpreqndl 0, 
nd2 ckstpreqndl 0 , 
nd3 ckstpreqndl 0 , 
pq ckstpreqndl_0, 
healthyO 1, 
healthyl 1, 
healthy2 1, 
healthy3 1, 
healthy4 1, 
ndl_ckstop_0) ; 



nd2_ckstop voter : m_voter PORT Map { 
elk 66 pal6, 
reset 0, 

ndO ckstpreqnd2 0, 
ndl ckstpreqnd2 0, 
nd2_ckstpreqnd2_0 , 
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nd3 ckstpreqnd2 0, 
pq ckstpreqnd2_0 , 
heal thy 0 1, 
healthyl 1, 
heal thy 2 1, 
heal thy 3 1, 
heal thy 4 1, 
nd2_ckstop_0) ; 

nd3_ckstop voter : Tn_voter PORT Map( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqnd3 0, 
ndl ckstpreqnd3 0, 
nd2 ckstpreqnd3 0, 
nd3 ckstpreqnd3 0, 
pq ckstpreqnd3_0, 
heal thy 0 1, 
healthyl 1, 
healthy2 1, 
heal thy 3 1, 
= healthy4 1, 

nd3_ckstop_0) ; 

pq_ckstop voter : m voter PORT Map ( 
elk 66 pal6, 
W reset 0, 

fg ndO ckstpreqpq 0, 

m ndl ckstpreqpq 0, 

nd2 ckstpreqpq 0, 
W nd3 ckstpreqpq 0, 

* pq ckstpreqpq_0 , 

healthyO 1, 
H healthyl 1, 

ill healthy2 1, 

heal thy 3 1, 
healthy4 1, 
pq_ckstop_0) ; 



C3 



-- this section was added to force a board level reset when 
the 8240 has a watchdog failure. 

this should have been done by feeding the 8240 »s WDFAIL 
to the reset PLD instead of forcing the 8240 's resetreq 
--to drive all other resetrequests . 

q ndO resetreq 0 <= ndO resetreq 0 AND pq resetreq 0; 
q ndl resetreq 0 <= ndl resetreq 0 AND pq resetreq 0 
q nd2 resetreq 0 <= nd2 resetreq 0 AND pq resetreq 0 
g_nd3_resetreqL0 <- nd3_resetreq_0 AND pq_resetreq_0 



reset_req voter : m voter PORT Map( 
elk 66 pal6, 
reset 0 , 

g ndO resetreq 0, 
g ndl resetreq 0, 
g nd2 resetreq 0, 
g_nd3_resetreq_0 , 
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pq resetreq__0, 
healthyO l, 
healthyl 1, 
healthy2 1, 
heal thy 3 1, 
healthy4 1, 
resetvote_0) ; 



healthyO 1 <= ndO ckstop 0; 
healthyl l <= ndl ckstop 0; 
healthy2 1 <= nd2 ckstop 0; 
healthy3 1 <= nd3 ckstop 0; 
healthy4_l <= pq_ckstop_0; 



o 

m 
w 

a 

o 
w 



}3 



ndO ckstopin 0 
ndl ckstopin 0 
nd2 ckstopin 0 
nd3 ckstopin 0 
pq_ckst opinio 



WITH i2c_rst_0 SELECT 



<= ndO ckstop 0; 
<= ndl ckstop 0; 
<= nd2 ckstop 0; 
<= nd3 ckstop 0; 
<= pq__ckstop__0 ; 



sda <= elk 33jpall WHEN '0', 
•Z'~ WHEN '1' , 

»Z' WHEN OTHERS; 



WITH i2c_rst_0 SELECT 



scl <= clk_33_J)all WHEN »0', 
< z ' WHEN ' 1 ' , 

»Z' WHEN OTHERS; 



hs led <= NOT(pxb0_hs_led) 



Sync Control 
process {clk_66_j)al6, reset_0) 

BEGIN 

IF (reset 0 = '0') THEN 
sync dl 
sync d2 
sync d3 
rsync x ndO 
rsync x ndl 
rsync x nd2 
rsync x nd3 
rsync x pxbO 
rsync_x__xbar 



ELSIF (testn 



AND reset 0 = 'l') THEN 



rsync x ndO 
rsync x ndl 
rsync x nd2 
rsync x nd3 
rsync x pxbO 
rsync_x_xbar 



<= 
<= 
<= 
<= 
<= 
<= 



tmsO ; 
tmsO ; 
tmsO ; 
tmsO ; 

'0' ; 

'0' ; 



ELSIF rising edge (elk 66 pal6) THEN 
svnc dl <= NOT (sync dl) ; 

sync__d2 <= (NOT (sync^d2) AND sync_dl OR sync_d2 AND 
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NOT (sync dl) ) 

sync d3 <= (NOT (NOT (sync dl) AND sync_d2) ) ; 



END IF; 
END process; 



rsync x ndO 
rsync x ndl 
rsync x nd2 
rsync x nd3 
rsync x pxbO 
r s yn c_x_xba r 



< = 

< ~ 
<= 



sync d3 
sync d3 
sync d3 
sync d3 
sync d3 
sync_d3 



X rst brd 0 <= reset 0; 
x_rst_brd__l <= NOT(reset_0) ; 



WITH sw elk mode2 1 
mux elk selO <= 



SELECT 
»0' 



1-^ 



w 
w 

5 - 



WITH sw elk mode2 1 SELECT 
mux elk sell <= »0' 



WITH sw elk mode2 1 SELECT 
f b__de v_by_2_0 < = ' 0 ' 



WITH sw clk_mode2 1 SELECT 
jxl_clk_oe 



WHEN 

•0' 

WHEN 

'0» 

'1' 



WHEN 
III 

•1» WHEN 
•0' 



"DO", 66MHz local 

WHEN "01", 33MHz cable 1 

"10", -- 33MHz cable 2 

WHEN "11", --'66 MHz local 

WHEN OTHERS; 



"00", 

WHEN "01", 
"10", 

WHEN "11", 



WHEN OTHERS; 



WHEN "00", 

•2' WHEN "01", 

'Z' WHEN "10", 

»0' WHEN "11", 

' 1' WHEN OTHERS; 



jx2_clk_oe 



<= '1' 




WHEN 


"00", 




' 1» 


WHEN 


"01", 


'1' 




WHEN 


"10", 




'1' 


WHEN 


"11", 


»1' 




WHEN 


OTHERS; 


mode2 1 SELECT 








<= '1' 




WHEN 


"00", 




111 


WHEN 


"01", 






WHEN 


"10", 




'1' 


WHEN 


"11", 


'1 • 




WHEN 


OTHERS; 



WITH sw clk_mode2 1 SELECT 
pll_rng_sel <= '1' 



WITH sw elk mode2 1 SELECT 



pll_f req_sel 



'1' 
'1' 



WHEN 
'1' 
WHEN 
»1» 



"00" , 
WHEN 
"10", 
WHEN 



"01", 
"11", 



WHEN OTHERS; 



WHEN 

' 0 • 
WHEN 

'2' 



"00" , 
WHEN 
"10", 

WHEN "11* 



"01" 
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Cfl 

w 
y 

C3 



select 0 skew for all modes 

rH sw_clk Tnode2_l SE 
fb sk sel <= 



WITH sw_clk mode 2 1 SELECT 
main sk selO <= 'Z' 



'1' 



WITH sw__clk mode2 1 SELECT 
main_sk_sell <= *Z' 

'1' 



Z' 


WHEN "00", 






' Z » WHEN 


"01", 


Z' 


WHEN "10", 






' Z ' WHEN 


"11", 


1' 


WHEN OTHERS; 





~jk_sk_selO <= 



jk_sk_sell <= 



WHEN "00", 
1 Z » WHEN 

WHEN "10", 
' Z » WHEN 

WHEN OTHERS; 



WHEN "00", 
' Z ' > WHEN 

WHEN "10", 
• Z » WHEN 

WHEN OTHERS; 



"01", 
"11", 

"01", 
"11", 



'Z' 


WHEN 


"00", 






»Z' 


WHEN 


"01" 


»Z' 


WHEN 


"10" , 






•Z' 


WHEN 


"11" 


'1' 


WHEN 


OTHERS; 





Z' 


WHEN 


"00" , 








WHEN 


"01", 


Z' 


WHEN 


"10", 






'Z' 


WHEN 


"11", 


1' 


WHEN 


OTHERS; 





END TOP_LEVEL_voter_SYnc; 
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/^ 



MC Standard Algorithms PPC Macro language Version 



File Name: 2:dOTPR4 VMX.K 

Description: CPP Source code for Vector Single Precision 
Split Complex Dot Product given that input 
vectors are relivatively unaligned. 

Entry/params : ZD0TPR4 VMX (A, I, J, C, N) 
ZIDOTPR4 VMX (A, I, B, J, C, N) 



Formula: C[0] 
C[l] 



sum (A->realp [ml] *B->realp [mJ] 

-/+ A->imagp [ml] *B->imagp [mJ] ) 
sum (A->realp [ml] *B->imagp [mJ] 

+ /- A->imagp [ml] *B->realp [mJ] ) 
for m=0 to N-1 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision Date 
0.0 000608 



Engineer Reason 
fpl Created (from zdotpr vrax.lc) 



# include " salppc , inc " 
/ ** 

ESAL CPP definitions 
** / 

#undef FUNC ENTRY 
#undef LOAD A 
#undef LOAD B 
#undef SUFFIX 

#if defined ( VMX SAL ) 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



FUNC ENTRY 
FUNC CONJ ENTRY 
LOAD A( vT, rA, 



zdotpr4 vmx 
_zidotpr4 vmx 
rB ) LVX( vT, rA, rB ) 
LOAD B( vT, rA, rB ) LVX( vT, rA, rB ) 
SUFFIX ( label ) label 



#elif defined ( VMX_NN ) 

#def ine FUNC ENTRY zdotpr 4 vmx nn 

#define FUNC CONJ ENTRY _2idotpr4_vmx_nn 

#define LOAD A( vT, rA, rB ) LVXL( vT, rA, rB ) 

#define LOAD B( vT, rA, rB ) LVXL ( vT, rA, rB ) 

#define SUFFIX { label ) label##_nn 

#elif defined ( VMX NC ) 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



FUNC ENTRY 
FUNC CONJ ENTRY 
LOAD A( vT, rA, 



zdotpr4 vmx nc 
_z i do t pr 4 __vmx_n c 
rB ) LVXL( vT, rA, rB ) 
LOAD B( vT, rA, rB ) LVX ( vT, rA, rB ) 
SUFFIX ( label ) label## nc 



#elif defined ( VMX CN ) 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



FUNC ENTRY 
FUNC CONJ ENTRY 
LOAD A( vT, rA, 
LOAD B{ VT, rA, 
SUFFIX ( label ) 



zdotpr4 vmx cn 
_zidotpr4 vmx cn 
rB ) LVX{ vT, rA, rB ) 
rB ) LVXL( vT, rA, rB ) 
label ## cn 



#elif defined { VMX__CC ) 
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#define FUNC ENTRY zdotpr4 vrtix cc 

#define FUNC CONJ ENTRY _zidotpr4 vmx cc 

#define LOAD A( vT, rA, rB ) LVX( vT, rA, rB ) 

#define LOAD B( vT, rA, rB ) L VX ( vT, rA, rB ) 

#define SUFFIX ( label ) label##_CG 

#else 

#error YOU MUST DEFINE VMX_xxx, where x = C or N 
#endif 



#define VREGSAVE_COND VRSAVE_COND /* defined as 7 in salppc.inc */ 
/** 

Local CPP definitions 

■k-k I 

#define NMASK2 0x8 

#define NMASKl 0x4 

#def ine NSHIFT 4 

ttdefine ADDRESS INCREMENT 16 



|s& / * * 

Q Input args 

#def ine A r3 

#define I r4 

#define B r5 

#define J r6 

#define C r7 

ffi #define N r8 

|;3 #define EFLAG r9 



^ /** 

13 Split complex parameters 

5 '-5 * * / 

P #define ArO A 

#define AiO rlO 

#define BrO B 

#define BiO rll 

#define Cr C 

fU #define Ci rl2 



Local registers 
* * / 

#define count r4 
#define rtmpO r4 
#define rtmpl rl3 

#define Arl rl3 
#define Ail rl4 
#define Ar2 rl5 
#define Ai2 rl6 
#define Ar3 rl7 
#define Ai3 rl8 



#define Brl rl9 
#define Bil r20 
#define Br2 r21 
ttdefine Bi2 r22 
#define Br 3 r23 
#define Bi3 r24 
ttdefine aoffset r25 
#define coffset r25 
#define boffset r26 
#define addr incr r27 
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/** 

VMX registers 
* */ 

#define rsumr vO 
#define rsumi vl 
#define isumr v2 
#define isumi v3 
idefine rsumO v4 
#define rsuml v5 
#define isumO v6 
ttdefine isuml v7 



#define arO v4 
#define aiO v5 
#define arl v6 
#define ail v7 
#define ar2 v8 
#define ai2 v9 
ttdefine ar3 vlO 
#define ai3 vll 



#define brO vl2 
#define biO vl3 
#define brl vl4 
13 #define bil vl5 

#define br2 vl6 
rS: #define bi2 vl7 

#def ine br3 vl8 
^ #define bi3 vl9 

tn ttdefine apC v20 



hi 



#define atrO v21 
#define atiO v22 
^ #define atrl v23 

#define atil v24 
#define atr2 v25 
111 #define at 12 v2 6 

#define atr3 v27 
#define ati3 v28 

/** 

FPU registers 
**y 

#define far fO 

#define fbr fl 

#define fai f2 

#define fbi f3 

#define frsumr f4 

#define frsumi f5 

#define fisumi f6 

#define f isumr fl 

#define frsum f8 

#define fisum f9 
#def ine rsum vmx flO 
#define isum_vTnx fll 



Begin code text, Save some registers 
Here for conjugate inner product 
**/ 

U_ENTRY( FUNC CONJ_ENTRY ) 
MR{rttnpO, Cr) 
MR(Cr, Ci) 
MR{Ci, rtmpO) 
MR(rtmpO, BrO) 
MR (BrO, BiO) 
MR (BiO, rtmpO) 
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%0 



m 
m 



/ 



* * 



Here for normal inner product 
** / 

FUNC PROLOG 

U_ENTRY{ FUNC ENTRY ) 

DECLARE fO fll 

DECLARE r3 r27 

DECLARE_vO_v2 8 

/ ** 

Initial setup code 
**/ 

SAVE rl3 r27 

USE THRU v2 8( VREGSAVE__COND ) 

LFS( frsumr, ArO, 0 ) 

FSUBS ( f r sumr , f rsumr , f rsumr ) 

FMR ( f r sumi , f rsumr ) 

FMR ( f i s umr , f r sumr ) 

FMR ( f isumi , f rsumr) 

FMR (r sum vmx, f rsumr) 

FMR(isum_vmx, f rsumr) 

/ * * 

Process unaligned vector section first 
** / 

LABEL ( SUFFIX ( cont ) ) 

GET_VMX UNALIGNED_COUNT { count, BrO ) 
LI( aoffset, 0 ) 
LI( boffset, 0 ) 

BEQ( SUFFIX { aligned ) ) ^ * / 

SUB( N, count ) /* adjust N for after loop */ 

/** 

Here to do first 1 to 3 points using standard FP 
Store result for later post_loop processing 
**/ 

aoffset ) 



) 

) 
) 

fbr ) 
fbi ) 
fbi ) 
fbr ) 



LFSX( far, ArO 
LFSX( fai, AiO, aoffset 
DECR C( count ) 
LFSX( fbr, BrO, boffset 
LFSX( fbi, BiO, boffset 
FMULS( f rsumr, far, 
FMULsi frsumi, fai, 
FMULS ( f isumi, far, 
FMULS( fisumr, fai, 
ADDI ( ArO, ArO, 4 ) 
ADDI( AiO, AiO, 4 ) 
ADDI ( BrO, BrO, 4 ) 
ADDI{ BiO, BiO, 4 ) 
BEQ( SUFFIX ( aligned ) ) 
/** 

Loop does 1 or 2 more sum updates 
**/ 

LABEL ( SUFFIX ( pre_loop ) ) 



LFSX( far, ArO 
LFSX( fai, AiO, 
DECR C ( count ) 
LFSX{ fbr, BrO, 
LFSX( fbi, BiO, 
FMADDS ( f rsumr, 
ADDI( ArO, ArO, 
FMADDS ( frsumi , 
ADDK AiO, AiO, 
FMADDS { f isumi, 
ADDK BrO, BrO, 
FMADDS ( fisumr, 
ADDK BiO, BiO, 



aoffset 
aoffset 



boffset 
boffset 
far, 
4 ) 
fai, 
4 ) 
far, 
4 ) 
fai, 
4 ) 



) 
) 

) 
) 

fbr, f rsumr ) 

fbi, frsumi ) 

fbi, f isumi ) 

fbr, fisumr ) 



/* 



BNE( SUFFIX ( pre_loop) ) 



Here for VMX aligned loop code 
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Prepare for loop entry: assign loop pointers, counters 

IABEL( SUFFIX { aligned ) ) . 

SRWI C( count, N, 4 ) /* 16 per trip */ 
LVSL{ apC, ArO, aoffset ) 
LI( aoffset, 0 ) 
LI{ boffset, 0 ) 

ADDI ( Arl, ArO, 16 ) 
VXOR( rsumr, rsumr, rsumr ) 
ADDK Ar2, ArO, 32 ) 
ADDI{ Ar3, ArO, 48 ) 

ADDK Ail, AiO, 16 ) 
VXOR( isumi, isumi, isumi ) 
ADDK Ai2, AiO, 32 ) 
ADDK Ai3, AiO, 48 ) 

ADDK Brl, BrO, 16 ) 
VXOR( rsumi, rsumi, rsumi ) 
ADDK Br2, BrO, 32 ) 
ADDK Br3, BrO, 48 ) 

ADDK Bil, BiO, 16 ) 
ADDK Bi2, BiO, 32 ) 
VXOR( isumr, isumr, isumr ) 
ADDK Bi3, BiO, 4 8 ) 
BEQ( SUFFIX (two_l eft) ) 

/** 

Loop windin section 
**/ 

LOAD A( atrO, ArO, aoffset ) 
LOAD A( atiO, AiO, aoffset ) 
LOAD A( atrl, Arl, aoffset ) 
LOAD_A{ atil. Ail, aoffset ) 

LOAD A( atr2, Ar2 , aoffset ) 
LOAD A( ati2, Ai2, aoffset ) 
VPERM( arO, atrO, atrl, apC ) 
LOAD B{ brO, BrO, boffset ) 
LOAD B( biO, BiO, boffset ) 
DECR C( count ) 
VPERM( aiO, atiO, atil, apC } 
LOAD B( brl, Brl, boffset ) 
VPERM( arl, atrl, atr2, apC ) 
LOAD A( atr3, Ar3 , aoffset ) 
BR( SUFFIX ( mid_loop ) ) 

/** 

Top of vector loop 
•* * / 

LABEL ( SUFFIX { loop ) ) 
/* { */ 

LOAD A( atr2, Ar2 , aoffset ) 
VMADDFPC rsumr, ar3 , br3, rsumr ) 

LOAD A( at 12, Ai2, aoffset ) t 

VPERM{ arO, atrO, atrl, apC ) /* uses last pass value */ 

VMADDFP( rsumi, ai3 , bi3, rsumi ) 

LOAD B( brO, BrO, boffset ) 

LOAD B( biO, BiO, boffset ) 

DECR C( count ) 

VPERM( aiO, atiO, atil, apC ) 

LOAD B( brl, Brl, boffset ) 

VPERM( arl, atrl, atr2, apC ) 

VMADDFP( isumi, ar3 , bi3, isumi ) 

LOAD A( atr3, Ar3, aoffset ) 

VMADDFP( isumr, ai3, br3, isumr ) 

/** 
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Loop entry 
**/ 

LABEL ( SUFFIX ( mid loop ) ) 

VMADDFP( rsumr, arO, brO, rsumr ) 
VPERM( ail, atil, ati2, apC ) 
VMADDFP{ rsumi, aiO, biO, rsumi ) 
LOAD A( ati3, AiS, aoffset ) 
VMADDFP( isumr, aiO, brO, isumr ) 
LOAD B{ bil, Bil, boffset ) 
ADDI{ aoffset, aoffset, 64 ) 
VPERM( ar2, atr2 , atrS, apC ) ^ 
VMADDFP( isumi, arO, biO, isumi ) 
LOAD B( br2, Br2 , boffset ) 
VMADDFP( rsumr, arl, brl, rsumr ) 
LOAD B( bi2, Bi2 , boffset.) 
VMADDFP{ isumr, ail, brl, isurar ) 

/ ** 

Loop exit 
**/ 

VPERM( ai2, ati2, ati3, apC ) 

BEQ( SUFFIX (loop exit ) ) 

LOAD A( atrO, ArO, aoffset ) 
=- VMADDFP{ rsumi, ail, bil, rsumi ) 

:^ LOAD A( atiO, AiO, aoffset ) 

U VMADDFP( isumi, arl, bil, isumi ) 

£3 LOAD B{ br3, Br3, boffset ) 

."5: VPERM( ar3, atr3, acrO, apC ) 

^=-'2 VMADDFP( rsumr, ar2, br2, rsumr ) 

C= LOAD A{ atrl, Arl, aoffset ) 

(g VMADDFP{ rsumi, ai2, bi2, rsumi ) 

f% VPERM( ai3, ati3, atiO, apC ) 

JH' VMADDFP( isumi, ar2 , bi2, isumi ) 

LOAD B{ bi3, Bi3, boffset ) 
^ ADDK boffset, boffset, 64 ) 

LOAD A{ atil. Ail, aoffset ) 
W VMADDFP( isumr, ai2, br2, isumr ) 

III /* } */ 

1^ BR{ SUFFIX ( loop ) ) 

' /** 

*P windout section 

O **/ 

m LABEL ( SUFFIX (loop exit ) ) 

LOAD A{ atrO, ArO, aoffset ) 
VMADDFP{ rsumi, ail, bil, rsumi ) 
LOAD A{ atiO, AiO, aoffset ) 
VMADDFP( isumi, arl, bil, isumi ) 
LOAD B( br3, Br3, boffset ) 
VPERM( ar3, atrS, atrO, apC ) 
VMADDFP( rsumr, ar2 , br2, rsumr ) 
VMADDFP( rsumi, ai2, bi2 , rsumi ) 
VPERM( ai3, ati3, atiO, apC ) 
VMADDFP( isumi, ar2, bi2, isumi ) 
LOAD B( bi3, Bi3, boffset ) 
ADDK boffset, boffset, 64 ) 
VMADDFP{ isumr, ai2, br2, isumr ) 
VMADDFP( rsumr, ar3 , br3, rsumr ) 
VMADDFP( rsumi, ai3, bi3, rsumi ) 
VMADDFP( isumi, ar3, bi3, isumi ) 
VMADDFP{ isumr, ai3, br3, isumr ) 

Remaining sum updates 
**/ 

LABEL ( SUFFIX (two_left) ) 

A]SrDI_C( count, N, 0x8 ) /* bit 3 */ 
BEQ( SUFFIX (one_left ) ) 

LOAD B( brO, BrO, boffset ) 
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LOAD B( bio, BiO, boffset ) 
LOAD B( brl, Brl, boffset ) 
LOAD B( bil, Bil, boffset ) 
ADDI( boffset, boffset, 32 ) 

LOAD A( atrO, ArO, aoffset ) 
LOAD A( atiO, AiO, aoffset ) 
LOAD A( atrl, Arl, aoffset ) 
LOAD A( atil. Ail, aoffset ) 
LOAD A{ atr2, Ar2, aoffset ) 
LOAD A( ati2, Ai2 , aoffset ) 
ADDK aoffset, aoffset, 32 ) 

VPERM( arO, atrO, atrl, apC ) /* uses last pass value */ 
VPERM( aiO, atiO, atil, apC ) 
VPERM( arl, atrl, atr2, apC ) 
VPERM( ail, atil, ati2, apC ) 



in 

IS**- 



o 
w 



mi 



arO, 


brO, 


rsumr 


) 


aiO, 


biO, 


rsumi 


) 


aiO, 


brO, 


isumr 


) 


arO, 


biO, 


isumi 


) 


arl. 


brl. 


rsumr 


) 


ail. 


brl. 


isumr 


) 


ail. 


bil. 


rsumi 


) 


arl, 


bil. 


isumi 


) 



/* bit 2 */ 



VMR(atr3, atrl) 
VMR(ati3, atil) 

LABEL ( SUFFIX (one_left) ) 
ANDI_C( count, N, 0x4 ) 
BEQ{ SUFFIX (combine ) ) 

LOAD B( brO, BrO, boffset ) 
LOAD B( biO, BiO, boffset ) 
ADDK boffset, boffset, 16 ) 

LOAD A( atrO, ArO, aoffset ) 
LOAD A( atiO, AiO, aoffset ) 
LOAD A{ atrl, Arl, aoffset ) 
LOAD A( atil, Ail, aoffset ) 
ADDK aoffset, aoffset, 16 ) 

VPERM( arO, atrO, atrl, apC ) /* uses last pass value */ 
VPERM( aiO, atiO, atil, apC ) 

VMADDFP( rsumr, arO, brO , rsumr ) 
VMADDFP( rsumi, aiO, biO, rsumi ) 
VMADDFP{ isumr, aiO, brO , isumr ) 
VNADDFP( isumi, arO, biO, isumi ) 

/** 

combine partial sums, permute, write out results 
* *y 

LABEL ( SUFFIX (combine) ) 

VSUBFP( rsumr, rsumr, rsumi ) /* rsumr 

VADDFP( isumi, isumi, isumr ) 
/** 

8 bytes/cycle shuffle: . 
real/imag logic should be intermixed for efficiency 

■k* / 

VMRGHW ( r sumO , rsumr , rsumr ) 
ANDI C( addr incr, N, 0x3 ) 
VMRGHW (isumO, isumi, isumi) 
VMRGLW (rsumi, rsumr, rsumr) 
SUB{ addr incr, N, addr incr ) 
VMRGLW ( isumi , isumi , isumi ) 



rsumr - rsumi 



/* offset index for remainders */ 
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m 

i '■- 

m 



y 
I* 

m 



VADDFP( rsumO, rsuml, rsumO ) ^ ^ 
SLWKaddr incr, addr incr, 2 /* byte offset 
VADDFP( isumO, i sural, isuraO ) 



/ 



VMRGHW (rsuml, 
ADD(ArO, ArO, 
VMRGHW(isuml, 
ADD(AiO, AiO, 
VMRGLW{rsumO, 
ADD(BrO, BrO, 
VMRGLW (isumO , 
ADD(BiO, BiO, 
VADDFP( rsumr, 
LKcoffset, 0) 
VADDFP( isumi, 
** 

4 byte stores 
*/ 

STVEWX{ rsumr, 
STVEWX( isumi. 



rsumO; rsumO) 
addr incr) 
isumO, isumO) 
addr incr) 
rsumO, rsumO) 
addr incr) 
isumO, isumO) 
addr incr) 
rsuml, rsumO ) 
/* needed for output */ 
isumi, isumO ) 



Cr, coffset ) 
Ci, coffset ) 



/ * * 

Remainders of 1-3 more to do 
* */ 

ANDI_C( N, 3 ) 

LFS ( rsum vmx, Cr, 0 ) 

LFS( isum vmx, Ci, 0 ) 

BEQ( SUFFIX ( scaler_ymx__combine ) ) 

/ -kie 

Here to do last 1-3 points using standard FP 
**/ 

LABEL { SUFFIX ( post_100p ) ) 
LFS{ far, ArO, 0 ) 
LFS( fai, AiO, 0 ) 
DECR_C( N ) 
LFS( fbr, BrO, 0 ) 
LFS( fbi, BiO, 0 ) 

FMADDS( frsumr, far, fbr, frsumr ) 

FMADDS( frsumi, fai, fbi, frsumx ) 

FMADDS( f isumi, far, fbi, f isumi ) 

FMADDSC fisumr, fai, fbr, fisumr ) 

ADDKArO, ArO, 4) 

ADDKBrO, BrO, 4) 

ADDKAiO, AiO, 4) 

ADDKBiO, BiO, 4) 

BNE( SUFFIX ( post__loop) ) 

* 

Write out result 

* */ 

LABEL ( SUFFIX ( scaler vmx combine ) ) 

FSUBS( frsum, frsumr, frsumi ) /* rsumr = rsumr 

FADDS( fisum, f isumi, fisumr ) 

FADDS{ frsum, frsum, rsum vmx ) 

FADDS( fisum, fisum, isum_vmx ) 

STFS{ frsum, Cr, 0 ) 

STFS( fisum, Ci, 0 ) 
y ** 

return 

★ * / 

LABEL ( SUFFIX (ret) ) 

FREE THRU v28 ( VREGSAVE_COND ) 

REST rl3_r27 

RETURN 
FUNC EPILOG 
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MC Standard Algorithms PPC Macro language Version 



File Name: ZD0TPR4 VMX.MAC 

Description: Vector Single Precision Complex Dot Product 

CPP dutntny file for unaligned vector processing 

Entry/params : ZD0TPR4 VMX (A, I, B, J, C, N) 
Formula: C[0] = sum (A->realp [ml] *B->realp [mJ] 

- A- >imagp [ml] *B- >imagp [mJ] ) 
C[l] = sum (A->realp [ml] *B->imagp [mJ] 

+ A->imagp [ml] *B->realp [mJ] ) 
for m=0 to N-1 

Mercury Computer Systems, Inc. 
Copyriglit (c) 1998 All rights reserved 



Revision Date 
0.0 000607 



Engineer Reason 

fpl Created (from zdotpr vmx.mac) 



#if defined ( BUILD_MAX ) 

#undef VMX SAL 
#undef VMX NN 
#undef VMX NO 
#undef VMX CN 
#undef VMX_CC 

#if ]defined{ COMPILE_ESAL_JUMP TABLE ) 

/* 1 variant: _zdotpr4_vrax() */ 

#define VMX SAL 
#include "2dotpr4_vmx.k" 



#else 

#def ine VMX NN 

#include " zdotpr 4_vmx . k " 

#undef VMX NN 

#define VMX NC 

#include "2dotpr4_vmx.k" 

ttundef VMX NC 

#define VMX CN 

#include "2dotpr4_vmx.k" 

#undef VMX CN 

#define VMX CC 

# i ncl ude " z dotpr4_vmx . k " 

#undef VMX_CC 

#endif 

#endif 



/* 5 variants based on ESAL flag */ 



/* end COMPILE__ESAL_JUMP_TABLE */ 
/* end BUILD_MAX */ 
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IS 



MC Standard Algorithms PPC Macro language Version 



File Name: ZDOTPR.K 

Description: CPP Source code for Vector Single Precision 

Split Complex Dot Product 
Entry/ params : ZDOTPR (A, I, B, J, C, N) 

ZIDOTPR (A, I, B, J, C, N) 



Formula: C[03 
C[13 



sum (A->realp [ml] *B->realp [rnJ] 

-/+ A->imagp[mI] *B->imagp[mJ] ) 
sum (A- >realp [ml ] *B- >imagp [mJ] 

+/- A->imagp [ml] *B->realp [mJ] ) 
for m=0 to N-1 



Mercury Computer Systems, Inc. 
Copyright (c) 1998 All rights reserved 



Revision Date 



981215 
990310 
000131 
000223 
000717 



Engineer Reason 
fpl Created 

fpl Integrated with 750 library 

jflc salppc.inc changes 

fpl Fixed pre -loop bug 

fpl Added dsts, removed LVXLs 



# include " salppc . inc " 
/ ** 

ESAL CPP definitions 
* */ 

#undef FUNC CON J ENTRY 
#undef FUNC ENTRY 
#undef LOAD A 
#undef LOAD B 
#undef SUFFIX 

#if defined ( VMX_SAL ) 

#define FUNC ENTRY zdotpr vmx 

#define FUNC CON J ENTRY _zidotpr_vmx 

#define LOAD A( vT, rA, rB ) LVX( vT, 

#define LOAD B( vT, rA, rB ) LVX( vT, 

#define SUFFIX { label ) label 

#undef DSTA{ ptr, control ) 

#undef DSTB( ptr, control ) 

#define DSTA{ ptr, control ) 

#define DSTB ( ptr, control ) 

#undef DST ENABLE 



rA, rB ) 
rA, rB ) 



#elif defined! VMX_NN ) 

#define FUNC ENTRY zdotpr vmx nn 

#define FUNC CON J ENTRY _2idotpr_vmx_nn 

#define LOAD A( vT, rA, rB ) LVX ( vT, rA, rB ) 

#define LOAD B( vT, rA, rB ) L VX ( vT, rA, rB ) 

#define SUFFIX ( label ) label##__nn 

#undef DSTA( ptr, control ) 

#undef DSTB( ptr, control ) 

#define DSTA{ ptr, control ) 

#define DSTB ( ptr, control ) 

#undef DST ENABLE 



#elif defined ( VMX_NC ) 
#define FUNC ENTRY 



_zdotpr_vmx_nc 
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#def ine 
#def ine 
#def ine 
#def ine 
#undef 
#undef 
#define 

#def ine 
#def ine 



FUNC CONJ ENTRY 
LOAD A( vT, rA, 
LOAD B( vT, rA, 
SUFFIX ( label ) 



_2 idotpr__VTnx__nc 
rB ) LVX( vT, rA, 
rB ) LVX( vT, rA, 
label## nc 



DSTA{ 
DSTB( 
DSTA( 



ptr , 
ptr, 
ptr. 



control 
control 
control 



) 



rB 
rB 



DSTB{ ptr, 
DST ENABLE 



control ) 



DST( ptr, control, 0 ) \ 
ADDK ptr, ptr, 64 ) 



#elif defined { VMX_CN ) 



13 
^0 



m 
III 



#define FUNC ENTRY zdotpr vmx cn 

#define FUNC CONJ ENTRY _2idotpr_vmx_cn 

#define LOAD A( vT, rA, rB ) LVX( vT, rA, 

#define LOAD B( vT, rA, rB ) LVX ( vT, rA, 

#define SUFFIX ( label ) label##_cn 

#undef DSTA( ptr, control ) 

#undef DSTB{ ptr, control ) 

#define DSTA( ptr, control ) 

#define DSTB( ptr, control ) DST{ ptr. 



rB ) 
rB ) 



#define DST__mABLE 

#elif defined { VMX_CC ) 

//#define FUNC ENTRY 

#define FUNC ENTRY 

#define FUNC CONJ ENTRY 

#define LOAD A( vT, rA, 

#define LOAD B( vT, rA, 

#define SUFFIX ( label ) 

#undef DSTA( ptr, control 

#undef DSTB( ptr, control 

#define DSTA( ptr, control 

#define DSTB ( ptr, control 

#undef DST ENABLE 



control, 0 ) \ 
ADDI ( ptr, ptr, 64 ) 



zdotpr vrnx_cc 
zdotpr vmx 
_2 i do t pr_vmx_c c 
rB ) LVX( vT, rA, 
rB ) LVX( vT, 
label## cc 



) 



rA, 



rB 
rB 



#else 
#error 

#endif 



YOU MUST DEFINE VMX_xxx, where x = C or N 



#define VREGSAVE_COND VRSAVE_COND /* defined as 7 in salppc.inc */ 
/** 

Local CPP definitions 

#define NMASK2 0x8 

#define NMASKl 0x4 

#def ine NSHIFT 4 

#define ADDRESS_INCREMENT 16 

/ ** 
Input args 

#define A r3 

#define I r4 

#define B r5 

#define J r6 

#define C r7 

#define N r8 
#def ine EFLAG r9 



Split complex parameters 
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#define ArO A 
#define AiO rlO 
#define BrO B 
#define BiO rll 
ttdefine Cr C 
#define Ci rl2 

j-k-k 

Local registers 
* * / 

#define count r4 
#define rtmpO r4 
#define rtmpl rl3 

#define dst stride rl3 
#define num_blocks rl4 
#define Arl rl3 
#define Ail rl4 
#define Ar2 rl5 
#define Ai2 rl6 
#define Ar3 rl7 
1^. #define Ai3 rl8 

0 #define Brl rl9 
IS #define Bil r20 

#define Br2 r21 

% #define Bi2 r22 

#define Br 3 r23 

01 #define Bi3 r24 

f^f #define ptr offsetO r25 

X% #define ptr offsetl r26 

^ #define addr incr r27 

s #define dst rptr r28 

#define dst iptr r29 

#define dst_control r30 

/** 

VMX registers 
S. * * / 

13 #define rsumr vO 

fIJ #define rsumi vl 

#define isumr v2 

#define isumi v3 

#define rsumO v4 

#define rsuml v5 

#define isumO v6 

#def ine isuml v7 

#define arO v4 
#define aiO v5 
#define arl v6 
#define ail v7 
#define ar2 v8 
#define ai2 v9 
#define ar3 vlO 
#define ai3 vll 

ttdefine brO vl2 
#define biO vl3 
ttdefine brl vl4 
ttdefine bil vl5 
ttdefine br2 vl6 
ttdefine bi2 vl7 
ttdefine br3 vl8 
ttdefine bi3 vl9 
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FPU registers 

•k-k ^ 

#define far fO 

#define fbr fl 

#define fai f2 

#define fbi f3 

#define frsumr f4 

#define frsumi f5 

#define fisumi f6 

#define fisumr f7 

#define frsum f8 

#define fisum f9 
#define rsum vmx flO 
#define isuni_VTnx fll 



/** 

Begin code text. Save some registers 
Here for conjugate inner product 

* * / 

U__ENTRY( FUNC CONJ__ENTRY ) 

MR(rtTnpO, Cr) 

MR(Cr, Ci) 
iis: MR(Ci, rttnpO) 

MR (r tmpC, BrO) 
;2 MR (BrO, BiO) 

W MR (Bio, rtmpO) 

*=0 /** 

JfS; Here for normal inner product 

* * / 

?S U_ENTRy( FUNC ENTRY ) 

Ifl DECIjARE fO fll 

DECLARE r3 r30 

DECLARE_vO_vl9 

f^li Initial setup code 

I5H **/ 

r'*- SAVE rl3 r3 0 

1*5-. USE THRU Vl9( VREGSAVE_COND ) 

LFS( frsumr, ArO, 0 ) 

FSUBS{frsumr, frsumr, frsumr) 
W FMR( frsumi, frsumr) 

fy FMR(fisumr, frsumr) 

FMR ( f i sumi , f r sumr ) 

FMR(rsum vmx, frsumr) 

FMR ( i sum_vmx , frsumr ) 

/ * * 

Process unaligned vector section first 
* */ 

LABEL ( SUFFIX ( cont ) ) , « ^ 

GET_VMX UNALIGNED COUNT { count, ArO ) 
LI{~ptr offsetO, 0 ) 

BEQ( SUFFIX { aligned ) ) ^ i^^^ */ 

SUB{ N, N, count ) /* adjust N for after loop */ 

^Here to do first 1 to 3 points using standard FP 

Store result for later post_loop processing 
**/ 

LFSX( far, ArO, ptr offsetO ) 
LFSX( fai, AiO, ptr_offsetO ) 
DECR C( count ) 
LFSX{ fbr, BrO, ptr offsetO ) 
LFSX( fbi, BiO, ptr_offsetO ) 
FMULS{ frsumr, far, fbr ) 
FMULS( frsumi, fai, fbi ) 
FMULS( fisumi, far, fbi ) 
FMULSC fisumr, fai, fbr ) 
ADDI{ ArO, ArO, 4 ) 
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ry 



) 



ADDK AiO, AiO, 4 ) 
ADDK BrO, BrO, 4 ) 
ADDK BiO, Bio, 4 ) 
BEQ( SUFFIX { aligned ) 
/** 

Loop does 1 or 2 more sum updates 
* */ 

LABEL ( SUFFIX ( pre_loop ) ) 

LFSX( far, ArO, ptr off set 0 ) 
LFSX( fai, AiO, 
DECR C( count ) 
LFSX( fbr, BrO, 
LFSX( fbi, BiO, 
FMADDS( frsumr, 
ADDK ArO, ArO, 
FMADDS ( f rsumi , 
ADDK AiO, AiO, 
FMADDS{ fisumi, 
ADDK BrO, BrO, 
FMADDS( fisumr, 
ADDK Bio, Bio, 



/ 



BNE( SUFFIX ( pre_loop) ) 



ptr__offsetO ) 

ptr offsetO ) 
ptr offsetO ) 
far, fbr, frsumr ) 
4 ) 
fai , 
4 ) 
far, 
4 ) 
fai, 
4 ) 



fbi, f rsumi ) 



fbi, 
fbr. 



fisumi ) 
fisumr ) 



Here for VMX aligned loop code ^^^^ 
Prepare for loop entry: assign loop pointers, counters 
**/ 

LABEL ( SUFFIX { aligned ) ) 

'iLl^^lk!cS('contrri!faS^ ^ytes^er.bloclc, block.count, 

byte__stride ) 
#if defined ( DST_ENABLE ) 

#if defined ( EXPAND_NCC ) 

MR( dst rptr, Ar ) 

MR{ dst iptr, Ai ) 
#elif defined ( EXPAND_CNC ) 

MR( dst rptr, Br ) 

MR( dst_iptr, Bi ) 
#endif 



MAKE STREAM CODE ( dst control, 64, 
DSTA( dst rptr, dst control ) 
DSTA( dst iptr, dst control } 
DSTB( dst rptr, dst control ) 
DSTB( dst__iptr, dst_control ) 
#endif 



1, 0 ) 



16 per trip */ 



"?(addr !nc?!'.^DRElf INCREMENT) /* con^^^nrs-def ined above */ 
^SSr!o?fslt"'plf ofSeti)'^ /* will be adding addr_incr « 3 */ 

ADD{Arl, ArO, addr incr) 
VXOR{ rsumr, rsumr, rsumr ) 
ADD(Brl, BrO, addr incr) 
ADD (Ail, AiO, addr incr) ^ 
VXOR{ rsumi, rsumi, rsumi ) 
ADD{Bil, BiO, addr_incr) 

ADD(Ar2, Arl, addr incr) 
VXOR( isumr, isumr, isumr ) 
ADD(Br2, Brl, addr incr) 
ADD(Ai2, Ail, addr incr) ^ 
VXOR( isumi, isumi, isumi ) 
ADD{Bi2, Bil, addr_incr) 
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5 



rii 



ADD(Ar3, Ar2 , addr incr) 
ADD(Br3, Br2 , addr incr) 
ADD(Ai3, Ai2, addr incr) 
ADD{Bi3, Bi2, addr incr) 

SLWI (addr_incr , addr_incr, 3) /* bump by 8 elements */ 

/ * * 

Loop entry code 

* 

DSTA{ dst rptr, dst control ) 
LOAD A( arO, ArO, ptr offsetO ) 
DSTB ( dst rptr, dst control ) 
LOAD B( brO, BrO, ptr offsetO ) 
LOAD A( aiO; AiO, ptr offsetO ) 
LOAD__B( biO, BiO, ptr_offsetO ) 

/ ** 

Top of double loop structure 
**y 

LABEL ( SUFFIX (loopO ) ) 

LOAD A( an, Arl, ptr offsetO ) 
VMADDFP( rsumr, arO, brO^ rsumr ) 
DSTA( dst iptr, dst control ) 
LOAD B( brl, Brl, ptr offsetO ) 
VMADDFP{ rsumi, aiO, biO, rsumi ) 
LOAD A( ail. Ail, ptr offsetO ) 
LOAD B{ bil, Bil, ptr offsetO ) 
DSTB( dst iptr, dst_control ) 
DECR C( count ) 

LOAD A( ar2, Ar2, ptr offsetO ) 

VMADDFP( isumi, arO , biO, isumi ) 

VMADDFP( isumr, aiO, brO, isumr ) 

LOAD B{ br2, Br2, ptr offsetO ) 

VMADDFP ( rsumr, arl, brl, rsumr ) 

ADD (ptr offsetl, ptr_offsetl, addr__incr) 

VMADDFP ( rsumi, ail, bil, rsumi ) 

LOAD A( ai2, Ai2, ptr offsetO ) 

VMADDFP ( isumi, arl, bil, isumi ) 

LOAD B( bi2, Bi2, ptr offsetO ) 

VMADDFP ( isumr, ail, brl, isumr ) 

VMADDFP ( rsumr, ar2 , br2, rsumr ) 

LOAD A{ ar3, Ar3 , ptr offsetO ) 

VMADDFP ( rsumi, ai2, bi2, rsumi ) 

LOAD B( br3, Br3 , ptr offsetO ) 

LOAD A{ ai3, Ai3, ptr offsetO ) 

VMADDFP { isumi, ar2, bi2, isumi ) 

LOAD B( bi3, Bi3, ptr offsetO ) 

VMADDFP ( isumr, ai2, br2, isumr ) 

BEQ{ SUFFIX CloopO exit ) ) 

DSTA{ dst rptr, dst control ) 

LOAD A( arO, ArO, ptr offsetl ) 

VMADDFP { rsumr, ar3 , br3 , rsumr ) 

VMADDFP ( rsumi, ai3, bi3, rsumi ) 

DSTB ( dst rptr, dst control ) 

LOAD B{ brO, BrO , ptr offsetl ) 

VMADDFP ( isumi, ar3 , bi3, isumi ) 

LOAD A( aiO, AiO, ptr offsetl ) 

LOAD B( bio. Bio, ptr offsetl ) 

VMADDFP ( isumr, ai3, br3, isumr ) 

BR( SUFFIX (loopl ) ) 

y ** 

loop exit 
**/ 

LABEL ( SUFFIX (loopO exit ) ) 
MR (ptr offsetO, ptr offsetl) 
BR{ SUFFIX (loopl_exit ) ) 

/** 

Top of second loop 
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* */ 

LABEL ( SUFFIX (loopl ) ) 

LOAD A( arl, Arl, ptr offsetl ) 
VMADDFP( rsumr, arO, brO, rsumr ) 
DSTA( dst iptr, dst control ) 
LOAD B( brl, Brl, ptr offsetl ) 
VMADDFP( rsumi, aiO, biO, rsumi ) 
LOAD A( ail. Ail, ptr offsetl ) 
LOAD B( bil, Bil, ptr offsetl ) 
DSTB( dst iptr, dst_control ) 
DECR C( count ) 

LOAD A{ ar2, Ar2, ptr offsetl ) 
VMADDFP{ isumi, arO, biO, isumi ) 
VMADDFP( isumr, aiO, brO, isumr ) 
LOAD B( br2, Br2, ptr offsetl ) 
VMADDFP { rsumr, arl, brl, rsumr ) 
ADD (ptr offsetO, ptr^offsetO, addr_incr) 
VMADDFP( rsumi, ail, bil, rsumi ) 
LOAD A( ai2, Ai2, ptr offsetl ) 
V]yiADDFP( isumi, arl, bil, isumi ) 
LOAD B( bi2, Bi2, ptr offsetl ) 
VMADDFP { isumr, ail, brl, isumr ) 
VMADDFP( rsumr, ar2, br2, rsumr ) 
LOAD A( ar3, Ar3 , ptr offsetl ) 
VMADDFP ( rsumi, ai2, bi2, rsumi ) 
LOAD B{ br3, Br3 , ptr offsetl ) 
LOAD A( ai3, Ai3, ptr offsetl ) 
VMADDFP ( isumi, ar2, bi2, isumi ) 
LOAD B( bi3, Bi3 , ptr offsetl ) 
VMADDFP ( isumr, ai2, br2, isumr ) 
BEQ( SUFFIX {loopl exit ) ) 
DSTAC dst rptr, dst control ) 
LOAD A( arO, ArO, ptr offsetO ) 
VMADDFP ( rsumr, ar3, br3, rsumr ) 
VMADDFP ( rsumi, ai3, bi3, rsumi ) 
DSTB { dst rptr, dst control ) 
LOAD B( brO, BrO, ptr offsetO ) 
VMADDFP { isumi, ar3, bi3, isumi ) 
LOAD A( aiO, AiO, ptr offsetO ) 
LOAD B( bio, BiO, ptr offsetO ) 
VMADDFP ( isumr, ai3, br3, isumr ) 
BR( SUFFIX (loopO ) ) 

/** 

Drop out of loop, flush pipe 
**/ 

LABEL ( SUFFIX (loopl exit ) ) 
VMADDFP { rsumr, ar3 , br3 , 
VMADDFP ( rsumi , 
VMADDFP ( isumi, arS , 
VMADDFP ( isumr 

/** 

Remaining sum updates 
** ^ 

LABEL ( SUFFIX (two_left) 

ANDI_C{ count, N, 0x8 ) 
BEQ( SUFFIX (one left ) ) 



rsumr ) 
ai3, bi3, rsumi ) 
bi3. 



ai3, br3, 



) 



isumi ) 
isumr ) 



/* bit 3 */ 



LOAD A{ arc, ArO, ptr offsetO ) 

LOAD B{ brO, BrO, ptr offsetO ) 

LOAD A( aiO, AiO, ptr offsetO ) 

LOAD_B( bio, BiO, ptr_offsetO ) 



LOAD A( arl, Arl, ptr offsetO ) 

LOAD B{ brl, Brl, ptr offsetO ) 

LOAD A( ail. Ail, ptr offsetO ) 

LOAD_B( bil, Bil, ptr__offsetO ) 
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m 



m 



VMADDFP( rsumr, arO, brO , rsumr ) 

VMADDFP( rsumi, aiO, biO, rsumi ) 

VMADDFP{ isumi, arO, biO, isumi ) 

VMADDFP( isumr, aiO, brO , isumr ) 

VMADDFP( rsumr, arl, brl ; rsumr ) 
VMADDFP( rsumi, ail, bil, rsumi ) 
VMADDFP{ isumi, arl, bil, isumi ) 
VMADDFP( isumr, ail, brl, isumr ) 
ADDK ptr_offsetO, ptr_offsetO, 32 ) 

LABEL { SUFFIX (one_left) ) . ^ */ 

ANDI_C( count, N, 0x4 ) /* bxt 2 */ 
BEQ( SUFFIX (combine ) ) 
LOAD A( arO, ArO, ptr offsetO ) 
LOAD B( brO, BrO, ptr offsetO ) 
LOAD A{ aiO, AiO, ptr offsetO ) 
LOAD B{ biO, BiO, ptr offsetO ) 
VMADDFP( rsumr, arO, brO, rsumr ) 
ViyiADDFP( rsumi, aiO, biO, rsumi ) 
VMADDFP( isumi, arO, biO, isumi ) 
VMADDFP( isumr, aiO, brO, isumr ) 
ADDK ptr_offsetO, ptr_offsetO, 16 ) 

/** 

combine partial sums, permute, 
**/ 

LABEL ( SUFFIX (combine) ) 

VSUBFP( rsumr, rsumr, rsumi ) 

VADDFP( isumi, isumi, isumr ) 
y * * 

8 bvtes/cycle shuffle: . 
real/imag logic should be intermixed for efficiency 

**/ 

VMRGHW(rsumO, rsumr, rsumr) 
ANDI C( addr incr, N, 0x3 ) 
VMRGHW ( i sumO , i sumi , i sumi ) 
VMRGLW (rsumi, rsumr, rsumr) 
SUB{ addr incr, N, addr incr ) 
VMRGLW ( i suml , i sumi , i sumi ) 

VADDFP( rsumO, rsumi, rsumO ) x. * / 

SLWKaddr incr, addr incr, 2) /* byte offset */ 
VADDFP( isumO, isumi, isumO ) 



write out results 



/* rsumr = rsumr - rsumi */ 



/* offset index for remainders 



VMRGHW (rsumi , 
ADD (ArO, ArO, 
VMRGHW (isumi, 
ADD (AiO, AiO, 
VMRGLW (rsumO, 
ADD (BrO, BrO, 
VMRGLW (isumO , 
ADD(BiO, BiO, 
VADDFP ( rsumr. 



rsumO, rsumO) 
addr incr) 
isumO, isumO) 
addr incr) 
rsumO, rsumO) 
addr incr) 
i sumO , i sumO ) 
addr incr) 
rsumi, rsumO ) 



LI (ptr OffsetO, 0) /* needed for output */ 
VADDFP ( isumi, isumi, isumO ) 
/ * * 

4 byte stores 
* */ 

STVEWX( rsumr, 
STVEWX( isumi, 
/** 

Remainders of 1-3 more to do 
**/ 

ANDI__C{ N, N, 3 ) 
LFS( rsum vmx, Cr, 0 ) 
LFS( isum vmx, Ci, 0 ) 
BEQ( SUFFIX ( scaler_vmx_combine ) ) 



Cr, ptr OffsetO ) 
Ci, ptr_offsetO ) 
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C3 

a 



III 



Here to do last 1-3 points using standard FP 
**y 

LABEL ( SUFFIX ( post_loop ) ) 
LFS( far, ArO, 0 ) 
LFS( fai, AiO, 0 ) 
DECR_G( N ) 
LFS( fbr, BrO, 0 ) 
LFS( fbi, Bio, 0 ) 
FMADDS ( frsumr, far, 
FMADDS ( frsumi, fai, 
FMADDS ( fisumi, far, 
FMADDS ( fisumr, fai, 
ADDKArO, ArO, 4) 
ADDKBrO, BrO, 4) 
ADDKAiO, AiO, 4) 
ADDKBiO, Bio, 4) 
BNE( SUFFIX ( post_loop) 

/** 

Write out result 
**/ 

LABEL ( SUFFIX ( scaler vmx combine ) ) 

FSUBS( frsum, frsumr, frsumi ) /* rsumr = rsumr 
FADDS { fisum, fisunii, fisumr ) 
FADDS( frsum, frsum, rsum vrnx ) 
FADDS ( fisum, fisum, isum vmx ) 



fbr, 
fbi, 
fbi, 
fbr. 



) 



f rsumr ) 
frsumi ) 
fisumi ) 
fisumr ) 



rsumi */ 



STFS{ frsum, Cr, 0 ) 
STFS{ fisum, Ci, 0 ) 
/** 

return 
* *y 

LABEL ( SUFFIX (ret) ) 

FREE THRU vl9 { VREGSAVE__COND ) 

REST rl3_r30 

RETURN 
FUNG EPILOG 



ry 
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MC Standard Algorithms -- PPC Macro language Version 



Pile Name: ZDOTPR. MAC 

Description: Vector Single Precision Complex Dot Product 
Entry/params: ZDOTPR (A, I, B, J, C, N) 
Formula: CCO] = sum (A->realp [ml] *B->realp [mJ] 

- A- >imagp [ml] *B- >imagp [mJ] ) 
C[l] = sum {A->realp [ml] *B->imagp [mJ] 

+ A- >imagp [ml] *B- >realp [mJ] ) 
for m=0 to N-1 



llJ 



Mercury Computer Systems, Inc. 
Copyright (c) 1998 All rights reserved 

Revision Date Engineer Reason 

0.0 981209 fpl Created (from cdotpr.mac) 

0.1 990310 fpl 750/G4 integration 

0-1 990322 fpl Stylistic changes 



#define COMPILE_ESAL_JUMP_TABLE 

#define FUNC_TYPE ZDOTPR 

#if defined { BUILD_MAX ) 

#undef VMX SAL 
#undef VMX NN 
#undef VMX NC 
#undef VMX CN 
#undef VMX_CC 

#if I defined { COMPILE ESAL__JUMP_TABLE ) | | defined ( 
COMPILE_NO_ESAL_JUMP_TABLE ) 

/* 1 variant: _zdotpr_vmx ( ) */ 

#define VMX SAL 
ttinclude "zdotpr_vmx.}c" 



#else 



/* 5 variants based on ESAL flag */ 



#define VMX NN 
#include "zdotpr_vmx.k" 

#undef VMX NN 

#define VMX NC 

# include " zdotpr_vmx . k" 

#undef VMX NC 

#define VMX CN 

# include "zdotpr_vmx.k" 

#undef VMX CN 
#define VMX CC 
# include "zdotpr_vmx.k" 
#undef VMX_CC 

#endif /* end COMPILE_ESAL_JUMP_TABLE */ 

#endif /* end BUILD_MAX */ 
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